Lifan Yuan
@lifan__yuan
PhD student @UofIllinois @uiuc_nlp @GoogleDeepMind. Prev: @TsinghuaNLP
How to unlock advanced reasoning via scalable RL? 🚀Introducing PRIME (Process Reinforcement through Implicit Rewards) and Eurus-2, trained from Base model to surpass Qwen2.5-Math-Instruct using only 1/10 of the data. We're still scaling up - w/ 3x more training data to go! 🧵


We update an analysis about format correction in 1-shot RLVR which is asked frequently (thx for all feedback!) Summary: (1) Format correction indeed contributes a lot in RLVR (e.g., 18% -> 29% for Qwen2.5-Math-1.5B over 6 math tasks), in both full-set (1.2k data) and one-shot…
With Grok-4, RL is the new pre-training
We built 200k-GPU clusters; We scaled up & curated higher-quality data; We scaled compute by 100x; We developed training & test-time recipes; We made everything RL native; We stabilized infrastructure and speeded up; That's how you turn RL into the pre-training scale. Yet I am…
🚨 Deadline for SCALR 2025 Workshop: Test‑time Scaling & Reasoning Models at COLM '25 @COLM_conf is approaching!🚨 scalr-workshop.github.io 🧩 Call for short papers (4 pages, non‑archival) now open on OpenReview! Submit by June 23, 2025; notifications out July 24. Topics…
🚀 I'm looking for full-time research scientist jobs on foundation models! I study pre-training and post-training of foundation models, and LLM-based coding agents. The figure highlights my research/publications. Please DM me if there is any good fit! Highly appreciated!
It shares a similar spirit to scaling law, where we can use early runs to fit a and b, then predict the final perf (H=0, R = -a+b). Also, it's base models that determine the ceiling, not algos, which have the same efficiency in consuming entropy, indicated by the similar a.
🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…
Can entropy minimization alone improve LLM performance? And how far can they go without any labeled data? This work answers both: yes, and surprisingly far 🐮 At inference EM can beat GPT4o Claude 3 opus & Gemini 1.5 pro on challenging scientific coding w/o any data/model update