Runa Eschenhagen

@runame_

PhD student in machine learning @CambridgeMLG and research scientist intern @AIatMeta.

Joined October 2021

46Following

516Followers

Pinned

1/7 Still using Adam? If anyone wants to try a distributed PyTorch implementation of SOAP/eigenvalue-corrected Shampoo with support for low precision data types instead, here you go. github.com/facebookresear…

runame_'s tweet card. For optimization algorithm research and development. - facebookresearch/optimizers

771

821

71.0K

Runa Eschenhagen@runame_ · Jul 18

This past spring, I spent time with the @exolabs team to work on a new DL optimizer and wiring up clusters of Macs for distributed TRAINING on Apple Silicon. If you’re at ICML, be sure to come by the @ESFoMo workshop (posters 1-2:30pm) this Saturday. I’ll be there to share some…

MMatt Beton@MattBeton · Jul 11

I’m going to be in Vancouver next week for ICML! Would love to meet anyone involved with distributed training, infrastructure, inference engines, open source AI. I'll be presenting two papers: - EXO Gym - an open source framework for simulating distributed training algorithms…

106

30.0K

Runa Eschenhagen Retweeted

Frank Schneider@frankstefansch1 · Jul 18

At #ICML2025 and don't know which workshop to join? Why not come and celebrate/rant about open source ML with us? We got amazing speakers (@tri_dao is just one example)! Come by West Meeting Room 211-214 👋

343

Runa Eschenhagen Retweeted

Jihao Andreas Lin@JihaoAndreasLin · Jul 16

Excited to share our ICML 2025 paper: "Scalable Gaussian Processes with Latent Kronecker Structure" We unlock efficient linear algebra for your kernel matrix which *almost* has Kronecker product structure. Check out our paper here: arxiv.org/abs/2506.06895

1.0K

Runa Eschenhagen Retweeted

Thomas Zhang@ThomasTCKZhang · Jul 15

I’ll be presenting our paper “On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning” at ICML during the Tuesday 11am poster session! DL opt is seeing a renaissance 🦾; what can we say from a NN feature learning perspective? 1/8

12.0K

Runa Eschenhagen Retweeted

Bruno Mlodozeniec@kayembruno · Jul 14

You don't need bespoke tools for causal inference. Probabilistic modelling is enough. I'll be making this case (and dodging pitchforks) at our ICML oral presentation tomorrow.

2.0K

Runa Eschenhagen Retweeted

Thomas Pethick@tmpethick · Jul 6

When comparing optimization methods, we often change *multiple things at once*—geometry, normalization, etc.—possibly without realizing it. Let's disentangle these changes. 👇

214

Runa Eschenhagen Retweeted

Agustinus Kristiadi@akristiadi7 · Jul 4

📢 [Openings] I'm now an Assistant Prof @WesternU CS dept. Funded PhD & MSc positions available! Topics: large probabilistic models, decision-making under uncertainty, and apps in AI4Science. More on agustinus.kristia.de/openings/

3.0K

Runa Eschenhagen@runame_ · Jun 17

Never will be.

eelie@eliebakouch · Jun 17

Pre-training is not dead

140

15.0K

Runa Eschenhagen Retweeted

Mark Schmidt@MarkSchmidtUBC · Jun 7

My former PhD student Fred Kunstner has been awarded the @c_a_i_a_c Best Doctoral Dissertation Award: cs.ubc.ca/news/2025/06/f… His thesis on machine learning algorithms includes an EM proof "from the book", why Adam works, and the first provably-faster hyper-gradient method.

237

133

16.0K

Runa Eschenhagen Retweeted

Aaron Defazio@aaron_defazio · Jun 5

Why do gradients increase near the end of training? Read the paper to find out! We also propose a simple fix to AdamW that keeps gradient norms better behaved throughout training. arxiv.org/abs/2506.02285

545

398

61.0K

Runa Eschenhagen Retweeted

Arthur Douillard@Ar_Douillard · Jun 5

duality of humanity

2.0K

Runa Eschenhagen Retweeted

Antonio Orvieto@orvieto_antonio · May 29

Adam is similar to many algorithms, but cannot be effectively replaced by any simpler variant in LMs. The community is starting to get the recipe right, but what is the secret sauce? @gowerrobert and I found that it has to do with the beta parameters and variational inference.…

285

269

43.0K

Runa Eschenhagen Retweeted

Katie Everett@_katieeverett · May 22

1. We often observe power laws between loss and compute: loss = a * flops ^ b + c 2. Models are rapidly becoming more efficient, i.e. use less compute to reach the same loss But: which innovations actually change the exponent in the power law (b) vs change only the constant (a)?

620

596

242.0K

Runa Eschenhagen Retweeted

Aaron Defazio@aaron_defazio · May 12

Write the paper you would want to read

30.0K

Runa Eschenhagen@runame_ · May 6

This is a huge development. I want to highlight the theoreticians behind the scene, because this paper represents the realization of the impact of years of careful theoretical research. It starts with Greg Yang (@TheGregYang) opening up research on the muP scaling and…

NNolan Dey@DeyNolan · May 6

(1/7) @CerebrasSystems Paper drop: arxiv.org/abs/2505.01618 TLDR: We introduce CompleteP, which offers depth-wise hyperparameter (HP) transfer (Left), FLOP savings when training deep models (Middle), and a larger range of compute-efficient width/depth ratios (Right). 🧵 👇

403

350

53.0K

Runa Eschenhagen Retweeted

Bruno Mlodozeniec@kayembruno · May 5

Great to be back from Singapore from #ICLR2025, and super excited to have given my first oral presentation on influence functions for diffusion models!

6.0K

Runa Eschenhagen@runame_ · Apr 24

We have a fantastic lineup of speakers who have made deep contributions to open-source in ML, e.g. @sarahookr, @ChrisRackauckas, @SingularMattrix, @tri_dao, @BlancheMinerva, and Evan Shelhamer!

JJonathan Wenger@JonathanWenger5 · Apr 16

Built a new ML library? Maintain a crucial project? Improved OSS practices? Your work deserves recognition! Submit your contributions to the CODEML workshop @ #ICML2025. We're championing open-source in ML. 💻✨ Deadline May 19. codeml-workshop.github.io/codeml2025/

3.0K

Runa Eschenhagen Retweeted

Zhiyuan Li@zhiyuanli_ · Apr 24

Why does Adam outperform SGD in LLMs training? Adaptive step sizes alone don't fully explain this, as Adam also surpasses adaptive SGD. Is coordinate-wise adaptivity the secret? Not entirely—Adam actually struggles in the rotated parameter space! 🧵 (1/6) arxiv.org/abs/2410.08198

269

296

47.0K

Runa Eschenhagen Retweeted

guille@GuilleAngeris · Apr 24

literally en.m.wikipedia.org/wiki/Kronecker… is all you need

549

Runa Eschenhagen Retweeted

Cambridge MLG@CambridgeMLG · Apr 24

All the MLG papers at #ICLR2025 main conference!

3.0K