Xiang Yue

@xiangyue96

Postdoc @LTIatCMU. PhD from Ohio State @osunlp. Author of MMMU, MAmmoTH. Training & evaluating foundation models. Opinions are my own.

Pittsburgh, PA

Joined August 2021

799Following

5KFollowers

Pinned

Xiang Yue@xiangyue96 · Feb 6

Demystifying Long CoT Reasoning in LLMs arxiv.org/pdf/2502.03373 Reasoning models like R1 / O1 / O3 have gained massive attention, but their training dynamics remain a mystery. We're taking a first deep dive into understanding long CoT reasoning in LLMs! 11 Major…

xiangyue96's tweet image. Demystifying Long CoT Reasoning in LLMs
arxiv.org/pdf/2502.03373
Reasoning models like R1 / O1 / O3 have gained massive attention, but their training dynamics remain a mystery. We're taking a first deep dive into understanding long CoT reasoning in LLMs!

11 Major…

224

944

1.0K

179.0K

Xiang Yue Retweeted

Huan Sun (OSU)@hhsun1 · Jul 15

🚨 Postdoc Hiring: I am looking for a postdoc to work on rigorously evaluating and advancing the capabilities and safety of computer-use agents (CUAs), co-advised with @ysu_nlp @osunlp. We welcome strong applicants with experience in CUAs, long-horizon reasoning/planning,…

14.0K

Xiang Yue Retweeted

Tianyu Zheng@zhengtianyu4 · Jul 10

🚀 Thrilled to announce our new work: FR3E (First Return, Entropy-Eliciting Explore)! LLM reasoning with Reinforcement Learning often struggles with unstable and inefficient exploration. We propose FR3E, a structured framework to make it more robust & efficient.

8.0K

Xiang Yue@xiangyue96 · Jul 2

RL generalizes more broadly *because* it causes more specific, narrow representation updates – bearing on a few embeddings helpful for the reasoning process. Unsurprising; but good job demonstrating the mechanism.

XXiang Yue@xiangyue96 · Jul 2

To understand why, we explored internal model representations. We fed domain-specific queries and answers into both base and fine-tuned models and performed PCA on hidden layer activations. We found that RL-tuned models exhibited minimal shifts in representation space across…

4.0K

Xiang Yue Retweeted

CLS@ChengleiSi · Jun 30

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

169

595

202

138.0K

Xiang Yue@xiangyue96 · Jun 27

Our group is known for producing widely adopted benchmarks (MMMU, Mind2Web, TravelPlaner, ScienceAgentBench etc.). Mind2Web 2 is probably the benchmark we spent the most time on ever. 26 authors spent over 6 months to tackle the emerging evaluation crisis head-on. Check it out!

YYu Su@ysu_nlp · Jun 27

🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…

2.0K

Xiang Yue@xiangyue96 · Jun 27

Rigorously evaluating agentic systems has been one of our pursuits at @osunlp, with prior efforts including Mind2Web and ScienceAgentBench. Today we introduce Mind2Web 2 to evaluate the emerging Deep Research-like agents: It features realistic and diverse long-horizon web…

YYu Su@ysu_nlp · Jun 27

3.0K

Xiang Yue Retweeted

Yu Su@ysu_nlp · Jun 27

222

131

38.0K

Xiang Yue Retweeted

Martin Ziqiao Ma@ziqiao_ma · Jun 24

Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at any time to any view at any other time? Introducing 4D-LRM: a Large Space-Time Reconstruction Model that ... 🔹 Predicts 4D Gaussian primitives directly from…

100

16.0K

Xiang Yue@xiangyue96 · Jun 14

Are you at #CVPR2025? RoboSpatial Oral is today! 📅 June 14 (Sat) | 🕐 1:00 PM | 📍Oral Session 4B @ ExHall A2

CChan Hee (Luke) Song@luke_ch_song · Mar 31

🔥 VLMs aren’t built for spatial reasoning — yet. They hallucinate free space. Misjudge object fit. Can’t tell below from behind We built RoboSpatial to tackle that — a dataset for teaching spatial understanding to 2D/3D VLMs for robotics. 📝 Perfect review scores @CVPR 2025

2.0K

Xiang Yue@xiangyue96 · Jun 11

Our VisualPuzzles🧩benchmark shows similar findings as "The Illusion of Thinking": - More tokens ≠ better reasoning - Reasoning models often underperform non-reasoning ones - Models collapse on harder puzzles, despite sounding "thoughtful" 🧠 - Longer traces = confusion, not…

MMehrdad Farajtabar@MFarajtabar · Jun 5

🧵 1/8 The Illusion of Thinking: Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really "thinking"? 🤔 Or are they just throwing more compute towards pattern matching? The new Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks,…

118

13.0K

Xiang Yue Retweeted

Mohamed Aghzal@aghzalm · Jun 9

[1/6] LLMs/VLMs aren't reliable planners—can they evaluate plans? 🤔 Our #CVPR2025 paper tests this in path planning. We find that VLMs show weak low-level perception & hallucinated reasoning. 📄 arxiv.org/abs/2411.18711 📊 huggingface.co/datasets/maghz… 📅 Fri Jun 13 4-6 PM @ ExHall D

2.0K