Runa Eschenhagen
@runame_
PhD student in machine learning @CambridgeMLG and research scientist intern @AIatMeta.
1/7 Still using Adam? If anyone wants to try a distributed PyTorch implementation of SOAP/eigenvalue-corrected Shampoo with support for low precision data types instead, here you go. github.com/facebookresear…
This past spring, I spent time with the @exolabs team to work on a new DL optimizer and wiring up clusters of Macs for distributed TRAINING on Apple Silicon. If you’re at ICML, be sure to come by the @ESFoMo workshop (posters 1-2:30pm) this Saturday. I’ll be there to share some…
I’m going to be in Vancouver next week for ICML! Would love to meet anyone involved with distributed training, infrastructure, inference engines, open source AI. I'll be presenting two papers: - EXO Gym - an open source framework for simulating distributed training algorithms…
At #ICML2025 and don't know which workshop to join? Why not come and celebrate/rant about open source ML with us? We got amazing speakers (@tri_dao is just one example)! Come by West Meeting Room 211-214 👋
Excited to share our ICML 2025 paper: "Scalable Gaussian Processes with Latent Kronecker Structure" We unlock efficient linear algebra for your kernel matrix which *almost* has Kronecker product structure. Check out our paper here: arxiv.org/abs/2506.06895
I’ll be presenting our paper “On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning” at ICML during the Tuesday 11am poster session! DL opt is seeing a renaissance 🦾; what can we say from a NN feature learning perspective? 1/8
You don't need bespoke tools for causal inference. Probabilistic modelling is enough. I'll be making this case (and dodging pitchforks) at our ICML oral presentation tomorrow.
When comparing optimization methods, we often change *multiple things at once*—geometry, normalization, etc.—possibly without realizing it. Let's disentangle these changes. 👇
📢 [Openings] I'm now an Assistant Prof @WesternU CS dept. Funded PhD & MSc positions available! Topics: large probabilistic models, decision-making under uncertainty, and apps in AI4Science. More on agustinus.kristia.de/openings/
My former PhD student Fred Kunstner has been awarded the @c_a_i_a_c Best Doctoral Dissertation Award: cs.ubc.ca/news/2025/06/f… His thesis on machine learning algorithms includes an EM proof "from the book", why Adam works, and the first provably-faster hyper-gradient method.
Why do gradients increase near the end of training? Read the paper to find out! We also propose a simple fix to AdamW that keeps gradient norms better behaved throughout training. arxiv.org/abs/2506.02285
Adam is similar to many algorithms, but cannot be effectively replaced by any simpler variant in LMs. The community is starting to get the recipe right, but what is the secret sauce? @gowerrobert and I found that it has to do with the beta parameters and variational inference.…
1. We often observe power laws between loss and compute: loss = a * flops ^ b + c 2. Models are rapidly becoming more efficient, i.e. use less compute to reach the same loss But: which innovations actually change the exponent in the power law (b) vs change only the constant (a)?
This is a huge development. I want to highlight the theoreticians behind the scene, because this paper represents the realization of the impact of years of careful theoretical research. It starts with Greg Yang (@TheGregYang) opening up research on the muP scaling and…
(1/7) @CerebrasSystems Paper drop: arxiv.org/abs/2505.01618 TLDR: We introduce CompleteP, which offers depth-wise hyperparameter (HP) transfer (Left), FLOP savings when training deep models (Middle), and a larger range of compute-efficient width/depth ratios (Right). 🧵 👇
Great to be back from Singapore from #ICLR2025, and super excited to have given my first oral presentation on influence functions for diffusion models!
We have a fantastic lineup of speakers who have made deep contributions to open-source in ML, e.g. @sarahookr, @ChrisRackauckas, @SingularMattrix, @tri_dao, @BlancheMinerva, and Evan Shelhamer!
Built a new ML library? Maintain a crucial project? Improved OSS practices? Your work deserves recognition! Submit your contributions to the CODEML workshop @ #ICML2025. We're championing open-source in ML. 💻✨ Deadline May 19. codeml-workshop.github.io/codeml2025/
Why does Adam outperform SGD in LLMs training? Adaptive step sizes alone don't fully explain this, as Adam also surpasses adaptive SGD. Is coordinate-wise adaptivity the secret? Not entirely—Adam actually struggles in the rotated parameter space! 🧵 (1/6) arxiv.org/abs/2410.08198