Zichen Liu (@zzlccc)

Pinned

Z

Zichen Liu@zzlccc · Mar 21

🪂Understanding R1-Zero-Like Training: A Critical Perspective * DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning?? * The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO?? * Getting GRPO Done Right, we achieve a 7B AIME sota! 🧵 📜Full…

zzlccc's tweet image. 🪂Understanding R1-Zero-Like Training: A Critical Perspective
* DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning??
* The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO??
* Getting GRPO Done Right, we achieve a 7B AIME sota!
🧵

📜Full…

26

180

1.0K

291.0K

Z

Zichen Liu@zzlccc · 24 h

I would say it's not a bug but a technique by prior RL researchers... We can even treat the number of timesteps to be included for IS computation as a hyper-parameter, to get the best bias-variance tradeoff x.com/zzlccc/status/…

ZZichen Liu@zzlccc · Jul 27

Learning GSPO proposed by Qwen team: fig 1. they propose to use sequence likelihood for importance sampling fig 2. but from the RL course by @svlevine, this is the original form of off-policy PG fig 3. per-token IS in (Dr) GRPO is an approximation of it Am I missing anything?

1

2

21

8

2.0K

Z

Zichen Liu@zzlccc · Jul 27

Learning GSPO proposed by Qwen team: fig 1. they propose to use sequence likelihood for importance sampling fig 2. but from the RL course by @svlevine, this is the original form of off-policy PG fig 3. per-token IS in (Dr) GRPO is an approximation of it Am I missing anything?

zzlccc's tweet image. Learning GSPO proposed by Qwen team:

fig 1. they propose to use sequence likelihood for importance sampling
fig 2. but from the RL course by @svlevine, this is the original form of off-policy PG
fig 3. per-token IS in (Dr) GRPO is an approximation of it

Am I missing anything?

14

36

393

386

48.0K

Z

Zichen Liu@zzlccc · Jul 26

eat oat every morning; training with oat🌾the rest of the day: github.com/sail-sg/oat

1

0

12

0

788

Z

Zichen Liu@zzlccc · Jul 25

no wonder Gemini used natural language-only solutions to achieve IMO gold

4

0

10

1

3.0K

Zichen Liu Retweeted

X

Xiao Ma@yusufma555 · Jul 22

🚀🚀🚀 Ever wondered what it takes for robots to handle real-world household tasks? long-horizon execution, deformable object dexterity, and unseen object generalization — meet GR-3, ByteDance Seed’s new Vision-Language-Action (VLA) model! GR-3 is a generalizable…

8

89

494

291

40.0K

Zichen Liu Retweeted

A

AI Engineer@aiDotEngineer · Jul 19

🆕 Releasing our entire RL + Reasoning track! featuring: • @willccbb, Prime Intellect • @GregKamradt, Arc Prize • @natolambert, AI2/Interconnects • @corbtt, OpenPipe • @achowdhery, Reflection • @ryanmart3n, Bespoke • @ChrSzegedy, Morph with special 3 hour workshop from:…

11

189

1.0K

2.0K

109.0K

Z

Zichen Liu@zzlccc · Jul 20

100%, we should optimize for long-term returns instead of immediate reward, aka gamma > 0

YYi Tay@YiTayML · Jul 19

Doing the right thing is important, even if you get penalized for it.

0

1

0

647

Z

Zichen Liu@zzlccc · Jul 16

If life is a trajectory, and we are doing online continual RL, it seems that we are driven by many verifiable rewards 🤔

JJason Wei@_jasonwei · Jul 16

New blog post about asymmetry of verification and "verifier's law": jasonwei.net/blog/asymmetry… Asymmetry of verification–the idea that some tasks are much easier to verify than to solve–is becoming an important idea as we have RL that finally works generally. Great examples of…

2

3

24

7

3.0K

Z

Zichen Liu@zzlccc · Jul 15

Proud of my ml system colleagues who are doing amazing works with long-term vision

XXinyi Wan@UfotalentZju · Jul 15

This is a part of our Zero Bubble Pipeline Parallelism series, pushing throughput🚀, memory💾 and scalability📈of LLM training to extreme! Project link: github.com/sail-sg/zero-b…

0

1

6

1

522

Z

Zichen Liu@zzlccc · Jul 15

Honored our paper was selected as Best Paper Runner-Up at #ICML @ai4mathworkshop! Grateful to my incredible collaborators who will be presenting - wish I could join in person. Big thanks to the committee!

ZZichen Liu@zzlccc · Mar 21

🪂Understanding R1-Zero-Like Training: A Critical Perspective * DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning?? * The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO?? * Getting GRPO Done Right, we achieve a 7B AIME sota! 🧵 📜Full…

0

5

18

4

952

Zichen Liu Retweeted

X

Xinyi Wan@UfotalentZju · Jul 15

🚀 New at #ICML25: PipeOffload breaks the memory wall for LLM training with Pipeline Parallelism — linear activation memory scaling with zero performance loss. High memory usage is no longer a blocker. 👉 Follow our series of work pushing limits of PP!

0

5

14

1

1.0K

Z

Zichen Liu@zzlccc · Jul 14

Though not attending #ICML2025 in person, I'm super excited to share 3 accepted papers: 1.🎊Best Paper Honorable Mention @ AI4MATH workshop: Understanding R1-Zero-Like Training: A Critical Perspective (a.k.a Dr. GRPO but I think the paper is more than this loss fix) 2. Main…

zzlccc's tweet image. Though not attending #ICML2025 in person, I'm super excited to share 3 accepted papers:

1.🎊Best Paper Honorable Mention @ AI4MATH workshop: Understanding R1-Zero-Like Training: A Critical Perspective (a.k.a Dr. GRPO but I think the paper is more than this loss fix)

2. Main…

0

13

69

17

5.0K

Zichen Liu Retweeted

w

will brown@willccbb · Jul 14

fav recent multi-turn paper arxiv.org/pdf/2506.24119…

1

3

15

14

1.0K

Z

Zichen Liu@zzlccc · Jul 14

wow

DDushyant (e/acc)@DevDminGod · Jul 13

This is the new kimi k2 instruct model running on claude-code - I had it make a 3D Snake game with a backend and a frontend -- it made this in about 2 hours of coding and headless testing -- cost about $1.5 of API usage - 3D snake game in threejs - Flask server, with a sqllite…

1

2

11

4

1.0K

Z

Zichen Liu@zzlccc · Jul 8

This work is accepted by @COLM_conf 2025 See you in Canada! 🍁

ZZichen Liu@zzlccc · Mar 21

🪂Understanding R1-Zero-Like Training: A Critical Perspective * DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning?? * The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO?? * Getting GRPO Done Right, we achieve a 7B AIME sota! 🧵 📜Full…

1

9

78

18

4.0K

Z

Zichen Liu@zzlccc · Jul 6

I can literally Ctrl+F those sentences but can't see them with my eyes. Sad for them.

PPasquale Minervini@PMinervini · Jul 5

"in 2025 we will have flying cars" 😂😂😂

0

1

9

0

1.0K

Z

Zichen Liu@zzlccc · Jul 3

LLM + RL + Self-Play + Game = ♾ Infinity Possibility ♾ 💥New Paper Alert💥 🔗Paper: huggingface.co/papers/2506.24… 🔗Code: github.com/spiral-rl/spir…

SSimon Yu@simon_ycl · Jul 1

The future of RL+LLM? Self-play. Why? Competitive scenarios offer: ✅ Built-in verification ✅ Automated curriculum learning ✅ Infinite complexity scaling Games prove this works for multi-turn, multi-agent systems. But the real potential? Extending beyond games to real-world…

1

27

167

119

16.0K

Z

Zichen Liu@zzlccc · Jul 1

Self-play on zero-sum language games creates selection pressure for LLMs to develop transferrable reasoning patterns. Enjoyed building the multi-agent, multi-turn RL system and training agents that think strategically through self-play! Paper: huggingface.co/papers/2506.24… Code:…

BBo Liu (Benjamin Liu)@Benjamin_eecs · Jul 1

We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces…

0

12

38

15

3.0K

Z

Zichen Liu@zzlccc · Jul 1

In our latest paper, we discovered a surprising result: training LLMs with self-play reinforcement learning on zero-sum games (like poker) significantly improves performance on math and reasoning benchmarks, zero-shot. Whaaat? How does this work? We analyze the results and find…

BBo Liu (Benjamin Liu)@Benjamin_eecs · Jul 1

We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces…

6

61

273

149

26.0K