Yifei Zhou
@YifeiZhou02
@xai | PhD student @berkeley_ai (on leave), working on RL for LLM agents, prev @AIatMeta, Opinions my own
Excited to share that our multi-turn interactions workshop has been accepted by neurips!
🚀 Call for Papers — @NeurIPSConf 2025 Workshop Multi-Turn Interactions in LLMs 📅 December 6/7 · 📍 San Diego Convention Center Join us to shape the future of interactive AI. Topics include but are not limited to: 🧠 Multi-Turn RL for Agentic Tasks (e.g., web & GUI agents,…
It's the most dramatic mindset shift since I paused my PhD at Berkeley and joined xai, joining the creation of the most intelligence AI model and the most efficient team that operates at theoretically optimal speed. The history is rolling and there is nothing in the way.
Introducing Grok 4, the world's most powerful AI model. Watch the livestream now: x.com/i/broadcasts/1…
Introducing Grok 4, the world's most powerful AI model. Watch the livestream now: x.com/i/broadcasts/1…
Introducing e3 🔥 Best <2B model on math 💪 Are LLMs implementing algos ⚒️ OR is thinking an illusion 🎩.? Is RL only sharpening the base LLM distrib. 🤔 OR discovering novel strategies outside base LLM 💡? We answer these ⤵️ 🚨 arxiv.org/abs/2506.09026 🚨 matthewyryang.github.io/e3/
Our view on test-time scaling has been to train models to discover algos that enable them to solve harder problems. @setlur_amrith & @matthewyryang's new work e3 shows how RL done with this view produces best <2B LLM on math that extrapolates beyond training budget. 🧵⬇️…
Q-learning is not yet scalable seohong.me/blog/q-learnin… I wrote a blog post about my thoughts on scalable RL algorithms. To be clear, I'm still highly optimistic about off-policy RL and Q-learning! I just think we haven't found the right solution yet (the post discusses why).
🗞️ Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction TTI (Test-Time Interaction), a curriculum-based online reinforcement learning (RL) approach that trains agents by adaptively adjusting their rollout lengths. Using a Gemma 3 12B model, TTI produces…
What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.
Lot of work in agents these days is using reasoning RL to now train agents. But is that good enough? @jackbai_jkb & @JunhongShen1 show that its not: we also want RL to learn *how* to explore and *discover* novel behaviors, by scaling "in-context" interaction!…
Very excited for this one. We took a cautiously experimental view on NN optimizers, aiming to find something that just works. SPlus matches Adam within ~44% of steps on a range of objectives. Please try it out in your setting, or read below for how it works.…
It’s a really fun time working on this project. It turns out acting longer is the best axis to scale inference compute for agents. But it is tricky what’s the best way to do it!
🔥Unlocking New Paradigm for Test-Time Scaling of Agents! We introduce Test-Time Interaction (TTI), which scales the number of interaction steps beyond thinking tokens per step. Our agents learn to act longer➡️richer exploration➡️better success Paper: arxiv.org/abs/2506.07976
In this paper we explore he we can efficiently scale the inference time compute for agents. Instead of blindly scaling the number of tokens at each step, it would be much better to scale the number of interactions! Check out how we did it!
🧵 1/7 Should AI agents "think more" or "do more"? 🤔 The current trend is to scale test-time compute, making agents generate longer reasoning traces. But what if that’s the wrong approach for interactive tasks? In our new work, we argue for a new scaling dimension: Test-Time…
I always found it puzzling how language models learn so much from next-token prediction, while video models learn so little from next frame prediction. Maybe it's because LLMs are actually brain scanners in disguise. Idle musings in my new blog post: sergeylevine.substack.com/p/language-mod…
Looking back, some of the most effective methods that we've built for training LLM/VLM agents in multi-turn settings also *needed* to utilize such a hierarchical structure, e.g., ArCHer (yifeizhou02.github.io/archer.io/) by @YifeiZhou02, further showing the promise behind such ideas.
Can offline RL methods do well on any problem, as we scale compute and data? In our new paper led by @seohong_park, we show that task horizon can fundamentally hinder scaling for offline RL, and how explicitly reducing task horizon can address this. arxiv.org/abs/2506.04168 🧵⬇️
Self-Challenging LLM Agents Self-improving AI systems are starting to show up everywhere. Meta and colleagues present self-improvement for general multi-turn tool-use LLM agents. Pay attention to this one, devs! Here are my notes:
Is RL really scalable like other objectives? We found that just scaling up data and compute is *not* enough to enable RL to solve complex tasks. The culprit is the horizon. Paper: arxiv.org/abs/2506.04168 Thread ↓