Quanquan Gu

@QuanquanGu

Professor @UCLA, Research Scientist at ByteDance Seed | Recent work: SPIN, SPPO, DPLM, GPM, CryoFM, MARS, TPA, RPG | Opinions are my own

Los Angeles, CA

Joined August 2017

2KFollowing

15KFollowers

Pinned

Quanquan Gu@QuanquanGu · May 26

The RPG is out. Make KL-regularized Policy Gradient Correct Again! No more GRPO or Reinforce++ — their objectives and KL regularization are inherently inconsistent.

YYIFENG LIU@YIFENGLIU_AI · May 26

1/6 We introduce RPG, a principled framework for deriving and analyzing KL-regularized policy gradient methods, unifying GRPO/k3-estimator and REINFORCE++ under this framework and discovering better RL objectives than GRPO: Paper: arxiv.org/abs/2505.17508 Code:…

252

29.0K

Quanquan Gu@QuanquanGu · 4 h

Even in the sparsest attention map, you still get all the weight.

865

Quanquan Gu@QuanquanGu · Jul 25

LLM: Who’s Adam? Me: No, thanks.

3.0K

Quanquan Gu Retweeted

Aryo Pradipta Gema@aryopg · Jul 22

New Anthropic Research: “Inverse Scaling in Test-Time Compute” We found cases where longer reasoning leads to lower accuracy. Our findings suggest that naïve scaling of test-time compute may inadvertently reinforce problematic reasoning patterns. 🧵

155

1.0K

579

146.0K

Quanquan Gu Retweeted

Simo Ryu@cloneofsimo · Jul 22

Everyone get your top 1% quality dataset and train 100 epochs right now

475

352

124.0K

Quanquan Gu Retweeted

Umar Jamil@hkproj · Jul 22

Mistral started it DeepSeek scaled it Kimi K2 confirmed it: always more convenient to train an MoE

1.0K

368

48.0K

Quanquan Gu Retweeted

Sheryl Hsu@SherylHsu02 · Jul 19

The model solves these problems without tools like lean or coding, it just uses natural language, and also only has 4.5 hours. We see the model reason at a very high level - trying out different strategies, making observations from examples, and testing hypothesis.

593

71.0K

Quanquan Gu@QuanquanGu · Jul 19

Wait… this model didn’t even use Lean? That’s insane. Big congrats to the @OpenAI team. That’s incredible work!

MMark Chen@markchen90 · Jul 19

We achieved gold medal level performance on this year's IMO! Our model thinks and writes proofs in clear, plain‑English - no formal code required. Unlike the narrower systems used in past competitions, our model is built to reason broadly, far beyond contest problems.

375

40.0K

Quanquan Gu@QuanquanGu · Jul 19

In this economy, you have to choose: feed your model or feed yourself.

JJasper@zjasper666 · Jul 18

What do you think about this billboard idea for Hyperbolic?

4.0K

Quanquan Gu Retweeted

Mikhail Parakhin@MParakhin · Jul 18

Since nobody asked :-), here is my list of papers not to be missed from ICML: 1) Dion: distributed orthonormalized updates (well, technically not at ICML, but everyone's talking about it). 2) MARS: Unleashing the Power of Variance Reduction for Training Large Models 3) ...

432

449

66.0K

Quanquan Gu Retweeted

Francis Bach@BachFrancis · Jul 18

Tired of lengthy computations to derive scaling laws? This post is made for you: discover the sharpness of the z-transform! francisbach.com/z-transform/

266

180

18.0K

Quanquan Gu Retweeted

Qingyue Zhao@ZhaoQingyue · Jul 18

Drop by our poster in Ballroom A, West Building to check our cute analysis techniques and a rich set of future directions opened by our work.

2.0K

Quanquan Gu@QuanquanGu · Jul 17

μP plays a central role in scaling large language models, known for hyperparameter transfer & stability. But don’t overlook its feature learning power. 📈

ZZixiang Chen@_zxchen_ · Jul 17

Excited to share our work at #ICML2025! 🚀 We dive into how deep L-layer NNs under μP can learn rich features & guarantee global convergence. w/@TheGregYang , @ZhaoQingyue and @QuanquanGu Check the paper at: arxiv.org/abs/2503.09565 Poster Thursday at 11 am! 👇 [1/4]

3.0K

Quanquan Gu Retweeted

PapersAnon@papers_anon · Jul 17

Mixture of Raytraced Experts Stacked MoE architecture that can dynamically select sequences of experts, producing computational graphs of variable width and depth. Allows predictions with increasing accuracy as the computation cycles through the experts' sequence. Links below

146

11.0K

Quanquan Gu Retweeted

Volkan Cevher@CevherLIONS · Jul 14

Excited to give a tutorial with @leenaCvankadara on Training Neural Networks at Any Scale (TRAINS) @icmlconf at 13:30 (West Ballroom A). Our slides can be found here: go.epfl.ch/ICML25TRAINS Please join us.

101.0K

Quanquan Gu Retweeted

Shikai Qiu@ShikaiQiu · Jul 12

📉Learning rate decay is super effective and sometimes mysterious, but the simplest model of SGD on quadratic loss w/ noisy gradients almost perfectly predicts loss curves of transformers trained with Adam on real data, across schedules, model sizes, and token budgets. 1/4

148

152

26.0K

Quanquan Gu@QuanquanGu · Jul 13

从理论转到大模型，一路走来不讨喜。有人不适应你的变化，有人不希望你真的做成。 Losers and haters make noise. Builders build. Feel the AGI!

147

11.0K

Quanquan Gu@QuanquanGu · Jul 13

Can’t make it to #ICML2025 this year. People ask why I’m so obsessed with pretraining and scaling. Simple: the AGI era is here. I refuse to be irrelevant.

13.0K

Quanquan Gu Retweeted

Aakash Kumar Nain@A_K_Nain · Jul 7

Does a better pretraining loss result in better performance on downstream tasks? Do downstream scaling laws exist? What kind of relationship exists between pretraining loss and performance on downstream tasks? This latest paper from NYU studies the reliability of downstream…

3.0K