Xiang Yue
@xiangyue96
Postdoc @LTIatCMU. PhD from Ohio State @osunlp. Author of MMMU, MAmmoTH. Training & evaluating foundation models. Opinions are my own.
Demystifying Long CoT Reasoning in LLMs arxiv.org/pdf/2502.03373 Reasoning models like R1 / O1 / O3 have gained massive attention, but their training dynamics remain a mystery. We're taking a first deep dive into understanding long CoT reasoning in LLMs! 11 Major…

🚨 Postdoc Hiring: I am looking for a postdoc to work on rigorously evaluating and advancing the capabilities and safety of computer-use agents (CUAs), co-advised with @ysu_nlp @osunlp. We welcome strong applicants with experience in CUAs, long-horizon reasoning/planning,…
🚀 Thrilled to announce our new work: FR3E (First Return, Entropy-Eliciting Explore)! LLM reasoning with Reinforcement Learning often struggles with unstable and inefficient exploration. We propose FR3E, a structured framework to make it more robust & efficient.
RL generalizes more broadly *because* it causes more specific, narrow representation updates – bearing on a few embeddings helpful for the reasoning process. Unsurprising; but good job demonstrating the mechanism.
To understand why, we explored internal model representations. We fed domain-specific queries and answers into both base and fine-tuned models and performed PCA on hidden layer activations. We found that RL-tuned models exhibited minimal shifts in representation space across…
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
Our group is known for producing widely adopted benchmarks (MMMU, Mind2Web, TravelPlaner, ScienceAgentBench etc.). Mind2Web 2 is probably the benchmark we spent the most time on ever. 26 authors spent over 6 months to tackle the emerging evaluation crisis head-on. Check it out!
🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…
Rigorously evaluating agentic systems has been one of our pursuits at @osunlp, with prior efforts including Mind2Web and ScienceAgentBench. Today we introduce Mind2Web 2 to evaluate the emerging Deep Research-like agents: It features realistic and diverse long-horizon web…
🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…
🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…
Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at any time to any view at any other time? Introducing 4D-LRM: a Large Space-Time Reconstruction Model that ... 🔹 Predicts 4D Gaussian primitives directly from…
Are you at #CVPR2025? RoboSpatial Oral is today! 📅 June 14 (Sat) | 🕐 1:00 PM | 📍Oral Session 4B @ ExHall A2
🔥 VLMs aren’t built for spatial reasoning — yet. They hallucinate free space. Misjudge object fit. Can’t tell below from behind We built RoboSpatial to tackle that — a dataset for teaching spatial understanding to 2D/3D VLMs for robotics. 📝 Perfect review scores @CVPR 2025
Our VisualPuzzles🧩benchmark shows similar findings as "The Illusion of Thinking": - More tokens ≠ better reasoning - Reasoning models often underperform non-reasoning ones - Models collapse on harder puzzles, despite sounding "thoughtful" 🧠 - Longer traces = confusion, not…
🧵 1/8 The Illusion of Thinking: Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really "thinking"? 🤔 Or are they just throwing more compute towards pattern matching? The new Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks,…
[1/6] LLMs/VLMs aren't reliable planners—can they evaluate plans? 🤔 Our #CVPR2025 paper tests this in path planning. We find that VLMs show weak low-level perception & hallucinated reasoning. 📄 arxiv.org/abs/2411.18711 📊 huggingface.co/datasets/maghz… 📅 Fri Jun 13 4-6 PM @ ExHall D