Bo Liu (Benjamin Liu)
@Benjamin_eecs
RL PhD @NUSingapore | Intern @AIatMeta FAIR | Undergrad @PKU1898 | Building autonomous decision making system | Prev @deepseek_ai | DeepSeek-V2/VL/Prover SPIRAL
We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces…

We are launching a new benchmark for human-AI coordination 🙎🤖 in the Hanabi card game. 🎇🎆 If you are interested in developing methods that allow agents to collaborate w/ and support humans in complex partially observable tasks this is just for you. Crucially, we designed…
How can we build AI that can actually cooperate with humans? We are announcing the Ad-Hoc Human-AI Coordination Challenge (AH2AC2) – a Hanabi benchmark designed to push the frontier of human-AI cooperation, accepted as a spotlight poster at @icmlconf 2025! 🧵👇
Learning GSPO proposed by Qwen team: fig 1. they propose to use sequence likelihood for importance sampling fig 2. but from the RL course by @svlevine, this is the original form of off-policy PG fig 3. per-token IS in (Dr) GRPO is an approximation of it Am I missing anything?
Pleased to share that our Multi-Turn Interactions in LLMs workshop at NeurIPS 2025! …shop-multi-turn-interaction.github.io Welcome work on Multi-Turn RL, multi-turn human<->agent/agent<->agent/agent<->environment interactions, multi-turn tool use, multi-turn alignment, multi-turn evaluation,…
🚀 Call for Papers — @NeurIPSConf 2025 Workshop Multi-Turn Interactions in LLMs 📅 December 6/7 · 📍 San Diego Convention Center Join us to shape the future of interactive AI. Topics include but are not limited to: 🧠 Multi-Turn RL for Agentic Tasks (e.g., web & GUI agents,…
🤔Long-horizon tasks: How to train LLMs for the marathon?🌀 Submit anything on 🔁"Multi-turn Interactions in LLMs"🔁 to our @NeurIPSConf workshop by 08/22: 📕 Multi-Turn RL ⚖️ Multi-Turn Alignment 💬 Multi-Turn Human-AI Teaming 📊 Multi-Turn Eval ♾️You name it! #neurips #LLM
🚀 Call for Papers — @NeurIPSConf 2025 Workshop Multi-Turn Interactions in LLMs 📅 December 6/7 · 📍 San Diego Convention Center Join us to shape the future of interactive AI. Topics include but are not limited to: 🧠 Multi-Turn RL for Agentic Tasks (e.g., web & GUI agents,…
Very excited to share that an advanced version of Gemini Deep Think is the first to have achieved gold-medal level in the International Mathematical Olympiad! 🏆, solving five out of six problems perfectly, as verified by the IMO organizers! It’s been a wild run to lead this…
Super thrilled to share that our AI has has now reached silver medalist level in Math at #imo2024 (1 point away from 🥇)! Since Jan, we now not only have a much stronger version of #AlphaGeometry, but also an entirely new system called #AlphaProof, capable of solving many more…
Join us at #NeurIPS2025 workshop to explore the future of multi-turn AI interactions! We welcome submissions on RL for agents, alignment, evaluation methods, and more.
🚀 Call for Papers — @NeurIPSConf 2025 Workshop Multi-Turn Interactions in LLMs 📅 December 6/7 · 📍 San Diego Convention Center Join us to shape the future of interactive AI. Topics include but are not limited to: 🧠 Multi-Turn RL for Agentic Tasks (e.g., web & GUI agents,…
🚀 Call for Papers — @NeurIPSConf 2025 Workshop Multi-Turn Interactions in LLMs 📅 December 6/7 · 📍 San Diego Convention Center Join us to shape the future of interactive AI. Topics include but are not limited to: 🧠 Multi-Turn RL for Agentic Tasks (e.g., web & GUI agents,…
Today, we at @OpenAI achieved a milestone that many considered years away: gold medal-level performance on the 2025 IMO with a general reasoning LLM—under the same time limits as humans, without tools. As remarkable as that sounds, it’s even more significant than the headline 🧵
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).
ChatGPT can now do work for you using its own computer. Introducing ChatGPT agent—a unified agentic system combining Operator’s action-taking remote browser, deep research’s web synthesis, and ChatGPT’s conversational strengths.
We have exactly the same claim in our natural language RL arxiv.org/abs/2411.14251. I cannot believe even the phrases are so similar. I believe this is what differs traditional RL and learning from experience -- we should learn more from experience, not just reward.
Scaling up RL is all the rage right now, I had a chat with a friend about it yesterday. I'm fairly certain RL will continue to yield more intermediate gains, but I also don't expect it to be the full story. RL is basically "hey this happened to go well (/poorly), let me slightly…
Good blog on "era of exploration" - Data scarcity is the new bottleneck. LLMs consume data far faster than humans can produce it. We're running out of high-quality training data. - Pretraining solved exploration by accident. Pretraining effectively pays a massive, upfront…
SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning Author's Explanation: x.com/Benjamin_eecs/… Overview: SPIRAL introduces a self-play framework for LLMs to develop reasoning skills through multi-turn, zero-sum games,…
We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces…
More and better data for general reasoning beyond math? NaturalThoughts outperforms OpenThoughts3, LIMO, S1k, etc. on GPQA-Diamond, SuperGPQA, and MMLU Pro. Download 1M+ reasoning prompts: huggingface.co/datasets/faceb…
🌿Introducing NaturalThoughts 🌿 arxiv.org/abs/2507.01921 🎯 Data curation for general reasoning capabilities is still relatively underexplored. - We systematically compare different metrics for selecting high-quality and diverse reasoning traces in terms of data efficiency in…
LLM + RL + Self-Play + Game = ♾ Infinity Possibility ♾ 💥New Paper Alert💥 🔗Paper: huggingface.co/papers/2506.24… 🔗Code: github.com/spiral-rl/spir…
The future of RL+LLM? Self-play. Why? Competitive scenarios offer: ✅ Built-in verification ✅ Automated curriculum learning ✅ Infinite complexity scaling Games prove this works for multi-turn, multi-agent systems. But the real potential? Extending beyond games to real-world…
Facebook AI Research (FAIR) is a small, prestigious lab in Meta. We don't train large models like GenAI or MSL, so it's natural that we have limited GPUs. GenAI or MSL's success or failure, past or future, doesn't reflect the work of FAIR. It is important to make this distinction
No matter how AI evolves overnight—tech, career, how it may impact me—I remain committed to using "physics of language models" approach to predict next-gen AI. Due to my limited GPU access at Meta, Part 4.1 (+new 4.2) are still in progress, but results on Canon layers are shining
🌿Introducing NaturalThoughts 🌿 arxiv.org/abs/2507.01921 🎯 Data curation for general reasoning capabilities is still relatively underexplored. - We systematically compare different metrics for selecting high-quality and diverse reasoning traces in terms of data efficiency in…
SPIRAL enables LLMs to learn sophisticated reasoning through self-play on zero-sum games, entirely without human supervision. This groundbreaking framework creates an infinite, self-improving curriculum for autonomous AI development.
SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning
Training LLMs with self-play RL on Kuhn Poker improves math reasoning by 8.7% average.👇
We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces…
We are so back
We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces…