Jiahao Qiu
@JiahaoQiu99
PhD @Princeton|Prev. Undergrad @SJTU1896 @UMichCSE
The GAIA game is over, and Alita is the final answer. Alita takes the top spot in GAIA, outperforming OpenAI Deep Research and Manus. Many general-purpose agents rely heavily on large-scale, manually predefined tools and workflows. However, we believe that for general AI…

We don’t have AI self-improves yet, and when we do it will be a game-changer. With more wisdom now compared to the GPT-4 days, it's obvious that it will not be a “fast takeoff”, but rather extremely gradual across many years, probably a decade. The first thing to know is that…
🥍🥍Excited to share that "Collab: Controlled Decoding Using Mixture of Agents for AI Alignment" has been accepted at #ICLR2025 Q. How to provably combine multiple #expert #LLMs for a target task at #inferencetime ?? 💥 Collab More Details coming soon...
ReasonFlux-PRM Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs
ReasonFlux-PRM-1.5B/7B: New trajectory-aware PRMs that evaluate how LLMs reason — not just what they output. ✅ Better data selection ✅ Stronger RL policy guidance ✅ Improved test-time scaling Paper: arxiv.org/abs/2506.18896 Code and Model: github.com/Gen-Verse/Reas…
Research with amazing collaborators @JizeJiang, @MeitangLi, and @JingchengYang, guided by great advisors and supported by the generous help of talented researchers @BowenJin13, @XingyuFu2, and many open-source contributors (easyr1, verl, vllm... etc).
Excited to introduce VTool-R1! We’ve trained VLMs to “think visually” using RL, blending Python-based 🖼️visual edits with💡textual Chain-of-Thought reasoning. Our trained qwen2.5-VL-32B surpasses GPT-4o on ChartQA & TableVQA, and even the compact qwen2.5-VL-7B significantly…
What’s is the agent? What is the optimal behavior to achieve the predefined goal? And how to learn that behavior policy? We formally introduce a systematic Theory of Agent (ToA), analogous to the cognitive framework of Theory of Mind (ToM). Where ToM refers to the ability to…
Agent Distillation vs LLM Distillation Alita proposes agent distillation, different from the traditional distillation paradigm, which is much cheaper and easier through auto MCP generation! Our experiments show great improvement on the GAIA validation through agent distillation.…

🚀Shallow alignment is now an important problem in LLM alignment. Dr. Xinagyu Qi first proposed this problem in the field of safety alignment in Safety Alignment Should Be Made More Than Just a Few Tokens Deep. 🌟Our newest research systematically and comprehensively validates…

📢📢We are organizing a workshop at #NeurIPS 2025 on the Emergent Trust Risks in Large Reasoning Model. We are inviting members to join our Program Committee. If you are interested in any topics related to LLM safety, we welcome your participation🤩🤩! forms.gle/wjoVyW1Hq1M9Sg…
It is interesting to find that the second day after I posted the tweet introducing Alita, the GAIA leaderboard validation was removed. GAIA Leaderboard: huggingface.co/spaces/gaia-be… RIP🕯️🕯️🕯️

I heard that someone has refined their agent product using Alita’s paradigm in one day and achieved great performance. That is super cool. For more details in Alita: github.com/CharlesQ9/Alita Paper Link: arxiv.org/pdf/2505.20286
I just updated more discussions in github.com/CharlesQ9/Alita.
The GAIA game is over, and Alita is the final answer. Alita takes the top spot in GAIA, outperforming OpenAI Deep Research and Manus. Many general-purpose agents rely heavily on large-scale, manually predefined tools and workflows. However, we believe that for general AI…
Using LLMs to build AI scientists is all the rage now (e.g., Google’s AI co-scientist [1] and Sakana’s Fully Automated Scientist [2]), but how much do we understand about their core scientific abilities? We know how LLMs can be vastly useful (solving complex math problems) yet…