Yifei Zhou

@YifeiZhou02

@xai | PhD student @berkeley_ai (on leave), working on RL for LLM agents, prev @AIatMeta, Opinions my own

Joined July 2022

515Following

2KFollowers

Yifei Zhou@YifeiZhou02 · Jul 21

Excited to share that our multi-turn interactions workshop has been accepted by neurips!

MMulti-Turn Interaction LLM Workshop @ NeurIPS 2025@mti_neurips · Jul 21

🚀 Call for Papers — @NeurIPSConf 2025 Workshop Multi-Turn Interactions in LLMs 📅 December 6/7 · 📍 San Diego Convention Center Join us to shape the future of interactive AI. Topics include but are not limited to: 🧠 Multi-Turn RL for Agentic Tasks (e.g., web & GUI agents,…

3.0K

Yifei Zhou@YifeiZhou02 · Jul 10

It's the most dramatic mindset shift since I paused my PhD at Berkeley and joined xai, joining the creation of the most intelligence AI model and the most efficient team that operates at theoretically optimal speed. The history is rolling and there is nothing in the way.

xxAI@xai · Jul 10

Introducing Grok 4, the world's most powerful AI model. Watch the livestream now: x.com/i/broadcasts/1…

1.0K

101

78.0K

Yifei Zhou Retweeted

xAI@xai · Jul 10

Introducing Grok 4, the world's most powerful AI model. Watch the livestream now: x.com/i/broadcasts/1…

5.0K

8.0K

31.0K

9.0K

26.8M

Yifei Zhou Retweeted

Amrith Setlur@setlur_amrith · Jun 13

Introducing e3 🔥 Best <2B model on math 💪 Are LLMs implementing algos ⚒️ OR is thinking an illusion 🎩.? Is RL only sharpening the base LLM distrib. 🤔 OR discovering novel strategies outside base LLM 💡? We answer these ⤵️ 🚨 arxiv.org/abs/2506.09026 🚨 matthewyryang.github.io/e3/

11.0K

Yifei Zhou Retweeted

Aviral Kumar@aviral_kumar2 · Jun 13

Our view on test-time scaling has been to train models to discover algos that enable them to solve harder problems. @setlur_amrith & @matthewyryang's new work e3 shows how RL done with this view produces best <2B LLM on math that extrapolates beyond training budget. 🧵⬇️…

185

108

12.0K

Yifei Zhou Retweeted

Seohong Park@seohong_park · Jun 13

Q-learning is not yet scalable seohong.me/blog/q-learnin… I wrote a blog post about my thoughts on scalable RL algorithms. To be clear, I'm still highly optimistic about off-policy RL and Q-learning! I just think we haven't found the right solution yet (the post discusses why).

188

1.0K

160.0K

Yifei Zhou Retweeted

WebAgentlab@ICML25 ✈️@webagentlab · Jun 13

🗞️ Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction TTI (Test-Time Interaction), a curriculum-based online reinforcement learning (RL) approach that trains agents by adaptively adjusting their rollout lengths. Using a Gemma 3 12B model, TTI produces…

502

Yifei Zhou Retweeted

Jyo Pari@jyo_pari · Jun 13

What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.

131

528

3.0K

590.0K

Yifei Zhou Retweeted

Aviral Kumar@aviral_kumar2 · Jun 12

Lot of work in agents these days is using reasoning RL to now train agents. But is that good enough? @jackbai_jkb & @JunhongShen1 show that its not: we also want RL to learn *how* to explore and *discover* novel behaviors, by scaling "in-context" interaction!…

121

11.0K

Yifei Zhou Retweeted

Kevin Frans@kvfrans · Jun 10

Very excited for this one. We took a cautiously experimental view on NN optimizers, aiming to find something that just works. SPlus matches Adam within ~44% of steps on a range of objectives. Please try it out in your setting, or read below for how it works.…

637

459

99.0K

Yifei Zhou@YifeiZhou02 · Jun 10

It’s a really fun time working on this project. It turns out acting longer is the best axis to scale inference compute for agents. But it is tricky what’s the best way to do it!

JJunhong Shen@JunhongShen1 · Jun 10

🔥Unlocking New Paradigm for Test-Time Scaling of Agents! We introduce Test-Time Interaction (TTI), which scales the number of interaction steps beyond thinking tokens per step. Our agents learn to act longer➡️richer exploration➡️better success Paper: arxiv.org/abs/2506.07976

767

Yifei Zhou@YifeiZhou02 · Jun 10

In this paper we explore he we can efficiently scale the inference time compute for agents. Instead of blindly scaling the number of tokens at each step, it would be much better to scale the number of interactions! Check out how we did it!

JJack Bai@jackbai_jkb · Jun 10

🧵 1/7 Should AI agents "think more" or "do more"? 🤔 The current trend is to scale test-time compute, making agents generate longer reasoning traces. But what if that’s the wrong approach for interactive tasks? In our new work, we argue for a new scaling dimension: Test-Time…

2.0K

Yifei Zhou Retweeted

Sergey Levine@svlevine · Jun 8

I always found it puzzling how language models learn so much from next-token prediction, while video models learn so little from next frame prediction. Maybe it's because LLMs are actually brain scanners in disguise. Idle musings in my new blog post: sergeylevine.substack.com/p/language-mod…

177

1.0K

295.0K

Yifei Zhou Retweeted

Aviral Kumar@aviral_kumar2 · Jun 5

Looking back, some of the most effective methods that we've built for training LLM/VLM agents in multi-turn settings also *needed* to utilize such a hierarchical structure, e.g., ArCHer (yifeizhou02.github.io/archer.io/) by @YifeiZhou02, further showing the promise behind such ideas.

462

Yifei Zhou Retweeted

Aviral Kumar@aviral_kumar2 · Jun 5

Can offline RL methods do well on any problem, as we scale compute and data? In our new paper led by @seohong_park, we show that task horizon can fundamentally hinder scaling for offline RL, and how explicitly reducing task horizon can address this. arxiv.org/abs/2506.04168 🧵⬇️

129

7.0K

Yifei Zhou Retweeted

elvis@omarsar0 · Jun 5

Self-Challenging LLM Agents Self-improving AI systems are starting to show up everywhere. Meta and colleagues present self-improvement for general multi-turn tool-use LLM agents. Pay attention to this one, devs! Here are my notes:

131

690

772

76.0K

Yifei Zhou Retweeted

Seohong Park@seohong_park · Jun 5

Is RL really scalable like other objectives? We found that just scaling up data and compute is *not* enough to enable RL to solve complex tasks. The culprit is the horizon. Paper: arxiv.org/abs/2506.04168 Thread ↓

143

920

751

135.0K