Kung-Hsiang Steeve Huang
@steeve__huang
Research Scientist @SFResearch | Formerly: PhD @UofIllinois, PhD Fellow @AmazonScience, MSc @USCViterbi, BEng @HKUST | He/him/his 🇹🇼 | #NLP
Excited to share that CogAlign is accepted at #ACL2025 Findings! We investigated the "Jagged Intelligence" of VLMs – their surprising difficulty with basic visual arithmetics (e.g., counting objects, measuring angles) compared to their strong performance on harder visual tasks.…
Vision Language Models (VLMs) are great at many things, but they often fumble when it comes to simple visual arithmetics like counting or comparing lengths, hindering their understanding of charts 📈 and geometry 📐. Our new paper explores why this happens 🧐 and discover the…
User simulators bridge RL with real-world interaction // jessylin.com/2025/07/10/use… How do we get the RL paradigm to work on tasks beyond math & code? Instead of designing datasets, RL requires designing environments. Given that most non-trivial real-world tasks involve…
📣 Excited to announce SpaVLE: #NeurIPS2025 Workshop on Space in Vision, Language, and Embodied AI! 👉 …vision-language-embodied-ai.github.io 🦾Co-organized with an incredible team → @fredahshi · @maojiayuan · @DJiafei · @ManlingLi_ · David Hsu · @Kordjamshidi 🌌 Why Space & SpaVLE? We…
🚀 Excited to share our work led by my amazing labmate @zhenhailongW, PAPO: Perception-Aware Policy Optimization, an extension of GRPO for multimodal reasoning! No extra labels. No reward models. Just internal supervision. 🔥 Learning to perceive while learning to reason.
Learning to perceive while learning to reason! We introduce PAPO: Perception-Aware Policy Optimization, a direct upgrade to GRPO for multimodal reasoning. PAPO relies on internal supervision signals. No extra annotations, reward models, or teacher models needed. 🧵1/3
Learning to perceive while learning to reason! We introduce PAPO: Perception-Aware Policy Optimization, a direct upgrade to GRPO for multimodal reasoning. PAPO relies on internal supervision signals. No extra annotations, reward models, or teacher models needed. 🧵1/3
Now accepted to @COLM_conf 🤩 Super excited for Montreal 🇨🇦🍁 This also marks my third successful collaboration with my good friend @PhilippeLaban
Unlike math/code, writing lacks verifiable rewards. So all we get is slop. To solve this we train reward models on expert edits that beat SOTA #LLMs largely on a new Writing Quality benchmark. We also reduce #AI slop by using our RMs at test time boosting alignment with experts.
I've successfully defended my PhD thesis on automated information seeking! Extremely grateful to my advisor @hengjinlp, committee members and all collaborators. Next, I'll be joining @GoogleDeepMind as a research scientist! Link to defense slides: docs.google.com/presentation/d…
PhD #24 - Congratulations to Dr. Revanth Reddy @gangi_official on successfully defending his amazing PhD thesis and joining Google DeepMind as a research scientist! Many thanks to my friends and collaborators for co-advising him in the past several years!
🧠 How can AI evolve from statically 𝘵𝘩𝘪𝘯𝘬𝘪𝘯𝘨 𝘢𝘣𝘰𝘶𝘵 𝘪𝘮𝘢𝘨𝘦𝘴 → dynamically 𝘵𝘩𝘪𝘯𝘬𝘪𝘯𝘨 𝘸𝘪𝘵𝘩 𝘪𝘮𝘢𝘨𝘦𝘴 as cognitive workspaces, similar to the human mental sketchpad? 🔍 What’s the 𝗿𝗲𝘀𝗲𝗮𝗿𝗰𝗵 𝗿𝗼𝗮𝗱𝗺𝗮𝗽 from tool-use → programmatic…
🔍Deep Search ≠ Deep Research. It’s not about browsing, insight mining, coding, or report writing—it’s about retrieving signal from messy, scattered data: GDocs, Slack, Meeting, GitHub, OrgCharts, etc. Agents must reason across it all, know what to search and where to search!
🧪HERB - a benchmark that puts RAG systems to the test with real enterprise challenges! 📊 Even our best agentic RAG systems only hit 30% accuracy when dealing with scattered info across Slack, GitHub, docs & meetings 🔍 Key finding: Retrieval is the main bottleneck, not…
#ICML #cognition #GrowAI We spent 2 years carefully curated every single experiment (i.e. object permanence, A-not-B task, visual cliff task) in this dataset (total: 1503 classic experiments spanning 12 core cognitive concepts). We spent another year to get 230 MLLMs evaluated…
🧪HERB - a benchmark that puts RAG systems to the test with real enterprise challenges! 📊 Even our best agentic RAG systems only hit 30% accuracy when dealing with scattered info across Slack, GitHub, docs & meetings 🔍 Key finding: Retrieval is the main bottleneck, not…
✅ To appear at TMLR! Camera-ready version coming soon, with new experiments and additional discussions! As LLMs are increasingly used for creative writing or scientific idea generation, this shared imagination may prove to be a fundamental limitation on their effectiveness.
(1/12) Can different LLMs give you unique and novel ideas? Very likely NO! 🤖 "𝗦𝗵𝗮𝗿𝗲𝗱 𝗜𝗺𝗮𝗴𝗶𝗻𝗮𝘁𝗶𝗼𝗻: 𝗟𝗟𝗠𝘀 𝗛𝗮𝗹𝗹𝘂𝗰𝗶𝗻𝗮𝘁𝗲 𝗔𝗹𝗶𝗸𝗲" reveals: LLMs often 𝗮𝗴𝗿𝗲𝗲 on purely imaginary and hallucinated contents! Explore 🧵or full paper:…
Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at any time to any view at any other time? Introducing 4D-LRM: a Large Space-Time Reconstruction Model that ... 🔹 Predicts 4D Gaussian primitives directly from…
4/ I’m actually bullish medium term involving AI in customer experience. But IT depts must educate themselves. The details on CRMArenaPro and the gap between LLMs / enterprise CRM needs in a major new paper by @SFResearch’s @steeve__huang + team: arxiv.org/abs/2505.18878
arxiv.org/abs/2505.18878 Salesforce tried LLMs in Real Business Scenarios and Found Disappointing Performance Even from the Best
Great share as usual! Just read this related piece where a study showed issues with LLM-based agents not recognizing sensitive information and not adhering to appropriate data handling protocols: theregister.com/2025/06/16/sal… paper: arxiv.org/abs/2505.18878
🚀 I'm looking for full-time research scientist jobs on foundation models! I study pre-training and post-training of foundation models, and LLM-based coding agents. The figure highlights my research/publications. Please DM me if there is any good fit! Highly appreciated!
1/10🎉New paper on AI Agent and LLM judge safety "Helpful Agent Meets Deceptive Judge: Understanding Vulnerabilities in Agentic Workflows" As AI agents become increasingly autonomous, they often rely on feedback from judges (evaluators). These judges evaluate, critique, and…
Another paper drop, this time from Salesforce: "These results underscore a significant gap between current LLM capabilities and real-world enterprise demands, highlighting needs for improved multi-turn reasoning, confidentiality adherence, and versatile skill acquisition."
Gary, this research paper from Salesforce flew under the radar a bit. Even with flagship models like o1, customer service agents fail 65% of multi-turn tasks. There is a similar paper out from Microsoft. Both from May of this year. reddit.com/r/BetterOfflin…