Kung-Hsiang Steeve Huang

@steeve__huang

Research Scientist @SFResearch | Formerly: PhD @UofIllinois, PhD Fellow @AmazonScience, MSc @USCViterbi, BEng @HKUST | He/him/his 🇹🇼 | #NLP

Champaign, IL

Joined November 2017

283Following

1KFollowers

Pinned

Kung-Hsiang Steeve Huang@steeve__huang · May 17

Excited to share that CogAlign is accepted at #ACL2025 Findings! We investigated the "Jagged Intelligence" of VLMs – their surprising difficulty with basic visual arithmetics (e.g., counting objects, measuring angles) compared to their strong performance on harder visual tasks.…

KKung-Hsiang Steeve Huang@steeve__huang · Feb 18

Vision Language Models (VLMs) are great at many things, but they often fumble when it comes to simple visual arithmetics like counting or comparing lengths, hindering their understanding of charts 📈 and geometry 📐. Our new paper explores why this happens 🧐 and discover the…

5.0K

Kung-Hsiang Steeve Huang Retweeted

Jessy Lin@realJessyLin · Jul 10

User simulators bridge RL with real-world interaction // jessylin.com/2025/07/10/use… How do we get the RL paradigm to work on tasks beyond math & code? Instead of designing datasets, RL requires designing environments. Given that most non-trivial real-world tasks involve…

326

279

32.0K

Kung-Hsiang Steeve Huang Retweeted

Martin Ziqiao Ma@ziqiao_ma · Jul 10

📣 Excited to announce SpaVLE: #NeurIPS2025 Workshop on Space in Vision, Language, and Embodied AI! 👉 …vision-language-embodied-ai.github.io 🦾Co-organized with an incredible team → @fredahshi · @maojiayuan · @DJiafei · @ManlingLi_ · David Hsu · @Kordjamshidi 🌌 Why Space & SpaVLE? We…

8.0K

Kung-Hsiang Steeve Huang@steeve__huang · Jul 10

🚀 Excited to share our work led by my amazing labmate @zhenhailongW, PAPO: Perception-Aware Policy Optimization, an extension of GRPO for multimodal reasoning! No extra labels. No reward models. Just internal supervision. 🔥 Learning to perceive while learning to reason.

ZZhenhailong Wang@zhenhailongW · Jul 10

Learning to perceive while learning to reason! We introduce PAPO: Perception-Aware Policy Optimization, a direct upgrade to GRPO for multimodal reasoning. PAPO relies on internal supervision signals. No extra annotations, reward models, or teacher models needed. 🧵1/3

499

Kung-Hsiang Steeve Huang Retweeted

Zhenhailong Wang@zhenhailongW · Jul 10

2.0K

Kung-Hsiang Steeve Huang Retweeted

AK@_akhaliq · Jul 10

Perception-Aware Policy Optimization for Multimodal Reasoning

190

71.0K

Kung-Hsiang Steeve Huang@steeve__huang · Jul 8

Now accepted to @COLM_conf 🤩 Super excited for Montreal 🇨🇦🍁 This also marks my third successful collaboration with my good friend @PhilippeLaban

TTuhin Chakrabarty@TuhinChakr · Apr 21

Unlike math/code, writing lacks verifiable rewards. So all we get is slop. To solve this we train reward models on expert edits that beat SOTA #LLMs largely on a new Writing Quality benchmark. We also reduce #AI slop by using our RMs at test time boosting alignment with experts.

4.0K

Kung-Hsiang Steeve Huang Retweeted

Revanth Gangi Reddy@gangi_official · Jul 3

I've successfully defended my PhD thesis on automated information seeking! Extremely grateful to my advisor @hengjinlp, committee members and all collaborators. Next, I'll be joining @GoogleDeepMind as a research scientist! Link to defense slides: docs.google.com/presentation/d…

4.0K

Kung-Hsiang Steeve Huang Retweeted

Heng Ji@hengjinlp · Jul 3

PhD #24 - Congratulations to Dr. Revanth Reddy @gangi_official on successfully defending his amazing PhD thesis and joining Google DeepMind as a research scientist! Many thanks to my friends and collaborators for co-advising him in the past several years!

114

8.0K

Kung-Hsiang Steeve Huang Retweeted

May Fung@May_F1_ · Jul 2

🧠 How can AI evolve from statically 𝘵𝘩𝘪𝘯𝘬𝘪𝘯𝘨 𝘢𝘣𝘰𝘶𝘵 𝘪𝘮𝘢𝘨𝘦𝘴 → dynamically 𝘵𝘩𝘪𝘯𝘬𝘪𝘯𝘨 𝘸𝘪𝘵𝘩 𝘪𝘮𝘢𝘨𝘦𝘴 as cognitive workspaces, similar to the human mental sketchpad? 🔍 What’s the 𝗿𝗲𝘀𝗲𝗮𝗿𝗰𝗵 𝗿𝗼𝗮𝗱𝗺𝗮𝗽 from tool-use → programmatic…

179

116

13.0K

Kung-Hsiang Steeve Huang@steeve__huang · Jul 1

🔍Deep Search ≠ Deep Research. It’s not about browsing, insight mining, coding, or report writing—it’s about retrieving signal from messy, scattered data: GDocs, Slack, Meeting, GitHub, OrgCharts, etc. Agents must reason across it all, know what to search and where to search!

SSalesforce AI Research@SFResearch · Jul 1

🧪HERB - a benchmark that puts RAG systems to the test with real enterprise challenges! 📊 Even our best agentic RAG systems only hit 30% accuracy when dealing with scattered info across Slack, GitHub, docs & meetings 🔍 Key finding: Retrieval is the main bottleneck, not…

1.0K

Kung-Hsiang Steeve Huang Retweeted

Hokin Deng@DengHokin · Jun 29

#ICML #cognition #GrowAI We spent 2 years carefully curated every single experiment (i.e. object permanence, A-not-B task, visual cliff task) in this dataset (total: 1503 classic experiments spanning 12 core cognitive concepts). We spent another year to get 230 MLLMs evaluated…

539

578

107.0K

Kung-Hsiang Steeve Huang Retweeted

Salesforce AI Research@SFResearch · Jul 1

3.0K

Kung-Hsiang Steeve Huang@steeve__huang · Jun 26

✅ To appear at TMLR! Camera-ready version coming soon, with new experiments and additional discussions! As LLMs are increasingly used for creative writing or scientific idea generation, this shared imagination may prove to be a fundamental limitation on their effectiveness.

SSalesforce AI Research@SFResearch · Jul 24

(1/12) Can different LLMs give you unique and novel ideas? Very likely NO! 🤖 "𝗦𝗵𝗮𝗿𝗲𝗱 𝗜𝗺𝗮𝗴𝗶𝗻𝗮𝘁𝗶𝗼𝗻: 𝗟𝗟𝗠𝘀 𝗛𝗮𝗹𝗹𝘂𝗰𝗶𝗻𝗮𝘁𝗲 𝗔𝗹𝗶𝗸𝗲" reveals: LLMs often 𝗮𝗴𝗿𝗲𝗲 on purely imaginary and hallucinated contents! Explore 🧵or full paper:…

732

Kung-Hsiang Steeve Huang Retweeted

Martin Ziqiao Ma@ziqiao_ma · Jun 24

Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at any time to any view at any other time? Introducing 4D-LRM: a Large Space-Time Reconstruction Model that ... 🔹 Predicts 4D Gaussian primitives directly from…

100

16.0K

Kung-Hsiang Steeve Huang Retweeted

Dion Hinchcliffe@dhinchcliffe · Jun 17

4/ I’m actually bullish medium term involving AI in customer experience. But IT depts must educate themselves. The details on CRMArenaPro and the gap between LLMs / enterprise CRM needs in a major new paper by @SFResearch’s @steeve__huang + team: arxiv.org/abs/2505.18878

339

Kung-Hsiang Steeve Huang Retweeted

Dr. Theophano Mitsa ☦️🇬🇷🇺🇸@theomitsa · Jun 16

arxiv.org/abs/2505.18878 Salesforce tried LLMs in Real Business Scenarios and Found Disappointing Performance Even from the Best

478

Kung-Hsiang Steeve Huang Retweeted

elvis@omarsar0 · Jun 16

Great share as usual! Just read this related piece where a study showed issues with LLM-based agents not recognizing sensitive information and not adhering to appropriate data handling protocols: theregister.com/2025/06/16/sal… paper: arxiv.org/abs/2505.18878

5.0K

Kung-Hsiang Steeve Huang Retweeted

Yangyi Chen (on job market)@YangyiChen6666 · Jun 15

🚀 I'm looking for full-time research scientist jobs on foundation models! I study pre-training and post-training of foundation models, and LLM-based coding agents. The figure highlights my research/publications. Please DM me if there is any good fit! Highly appreciated!

128

17.0K

Kung-Hsiang Steeve Huang Retweeted

Salesforce AI Research@SFResearch · Jun 11

1/10🎉New paper on AI Agent and LLM judge safety "Helpful Agent Meets Deceptive Judge: Understanding Vulnerabilities in Agentic Workflows" As AI agents become increasingly autonomous, they often rely on feedback from judges (evaluators). These judges evaluate, critique, and…

2.0K

Kung-Hsiang Steeve Huang@steeve__huang · Jun 10

Another paper drop, this time from Salesforce: "These results underscore a significant gap between current LLM capabilities and real-world enterprise demands, highlighting needs for improved multi-turn reasoning, confidentiality adherence, and versatile skill acquisition."

TTim Hot Tub@inteldankweb · Jun 9

Gary, this research paper from Salesforce flew under the radar a bit. Even with flagship models like o1, customer service agents fail 65% of multi-turn tasks. There is a similar paper out from Microsoft. Both from May of this year. reddit.com/r/BetterOfflin…

103

511

296

49.0K