Quanquan Gu
@QuanquanGu
Professor @UCLA, Research Scientist at ByteDance Seed | Recent work: SPIN, SPPO, DPLM, GPM, CryoFM, MARS, TPA, RPG | Opinions are my own
The RPG is out. Make KL-regularized Policy Gradient Correct Again! No more GRPO or Reinforce++ — their objectives and KL regularization are inherently inconsistent.
1/6 We introduce RPG, a principled framework for deriving and analyzing KL-regularized policy gradient methods, unifying GRPO/k3-estimator and REINFORCE++ under this framework and discovering better RL objectives than GRPO: Paper: arxiv.org/abs/2505.17508 Code:…
Even in the sparsest attention map, you still get all the weight.
New Anthropic Research: “Inverse Scaling in Test-Time Compute” We found cases where longer reasoning leads to lower accuracy. Our findings suggest that naïve scaling of test-time compute may inadvertently reinforce problematic reasoning patterns. 🧵
Everyone get your top 1% quality dataset and train 100 epochs right now
Mistral started it DeepSeek scaled it Kimi K2 confirmed it: always more convenient to train an MoE
The model solves these problems without tools like lean or coding, it just uses natural language, and also only has 4.5 hours. We see the model reason at a very high level - trying out different strategies, making observations from examples, and testing hypothesis.
Wait… this model didn’t even use Lean? That’s insane. Big congrats to the @OpenAI team. That’s incredible work!
We achieved gold medal level performance on this year's IMO! Our model thinks and writes proofs in clear, plain‑English - no formal code required. Unlike the narrower systems used in past competitions, our model is built to reason broadly, far beyond contest problems.
In this economy, you have to choose: feed your model or feed yourself.
What do you think about this billboard idea for Hyperbolic?
Since nobody asked :-), here is my list of papers not to be missed from ICML: 1) Dion: distributed orthonormalized updates (well, technically not at ICML, but everyone's talking about it). 2) MARS: Unleashing the Power of Variance Reduction for Training Large Models 3) ...
Tired of lengthy computations to derive scaling laws? This post is made for you: discover the sharpness of the z-transform! francisbach.com/z-transform/
Drop by our poster in Ballroom A, West Building to check our cute analysis techniques and a rich set of future directions opened by our work.
μP plays a central role in scaling large language models, known for hyperparameter transfer & stability. But don’t overlook its feature learning power. 📈
Excited to share our work at #ICML2025! 🚀 We dive into how deep L-layer NNs under μP can learn rich features & guarantee global convergence. w/@TheGregYang , @ZhaoQingyue and @QuanquanGu Check the paper at: arxiv.org/abs/2503.09565 Poster Thursday at 11 am! 👇 [1/4]
Mixture of Raytraced Experts Stacked MoE architecture that can dynamically select sequences of experts, producing computational graphs of variable width and depth. Allows predictions with increasing accuracy as the computation cycles through the experts' sequence. Links below
Excited to give a tutorial with @leenaCvankadara on Training Neural Networks at Any Scale (TRAINS) @icmlconf at 13:30 (West Ballroom A). Our slides can be found here: go.epfl.ch/ICML25TRAINS Please join us.
📉Learning rate decay is super effective and sometimes mysterious, but the simplest model of SGD on quadratic loss w/ noisy gradients almost perfectly predicts loss curves of transformers trained with Adam on real data, across schedules, model sizes, and token budgets. 1/4
从理论转到大模型,一路走来不讨喜。 有人不适应你的变化,有人不希望你真的做成。 Losers and haters make noise. Builders build. Feel the AGI!
Can’t make it to #ICML2025 this year. People ask why I’m so obsessed with pretraining and scaling. Simple: the AGI era is here. I refuse to be irrelevant.
Does a better pretraining loss result in better performance on downstream tasks? Do downstream scaling laws exist? What kind of relationship exists between pretraining loss and performance on downstream tasks? This latest paper from NYU studies the reliability of downstream…