Enea Monzio Compagnoni @ Flexion Robotics
@EneaMC
PhD Student in Stochastic Optimization for Deep Learning @ the University of Basel. I smash calculations until it is night. Past: UBS; Yahoo! Research.
We asked SDEs for wisdom. They said: ‘DSignSGD = Chad, DCSGD = Sad.’💀🔥 #Oral #AISTATS2025 Noise hits compression differently! 📉 DCSGD crumbles under large & heavy-tailed noise. 💪 DSignSGD? Still rocks. 📜 Scaling rules for Distributed Learning!👇 arxiv.org/abs/2502.17009

As @micahgoldblum and coauthors, we also found that small batches make SGD effective in LM training. It's cool that our papers came out around the same time, and each has a different perspective! Below, our take on why this happens. Our awesome team: @teodorasrec @jonasgeiping…
Come to HilD tomorrow @ICML2025 ! We have 4 posters on optimization: - In Search of Adam’s Secret Sauce - Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling - On the Interaction of Noise, Compression Role, and Adaptivity under (L0,L1)-Smoothness…
To make things elegant (but later we will have intuition), let us talk about the SDE approximation of SGD and SignSGD (a good model for Adam). The SignSGD result is by @EneaMC (arxiv.org/abs/2411.15958), a must-read if you want to understand how adaptive methods react to noise.
🚀 Launch day! The NeurIPS 2025 PokéAgent Challenge is live. Two tracks: ① Showdown Battling – imperfect-info, turn-based strategy ② Pokemon Emerald Speedrunning – long horizon RPG planning 5 M labeled replays • starter kit • baselines. Bring your LLM, RL, or hybrid…
Pass by if you want to know about scaling up your model under distribution shifts of the training data. Take away: muP needs to be tuned to the optimal amount of feature learning that optimizes the forgetting/plasticity trade off.
🚨 Excited to present our new paper at 🇨🇦 #ICML2025! 🚨 "The Importance of Being Lazy: Scaling Limits of Continual Learning" Great collab with @alebreccia99, @glanzillo11 , Thomas Hofmann, @lorenzo_noci. 🧵 1/6
I’m pleased to share that, starting August 1, I will be joining MBZUAI (@mbzuai) as an Assistant Professor in the Department of Statistics and Data Science. My research focuses on optimization for machine learning, with an emphasis on stochastic methods, federated learning,…
I have 6 papers in my batch as a reviewer at @NeurIPSConf . I have reviewed 4 of them so far, and 3 among them are with mistakes in the proofs… And mistakes are usually easy to spot. But at least there is one work which seems to be interesting to me which is a rare case as well
If you are ever ablating on LM training, this is the ONLY codebase I trust, by the amazing Nico.
Great work with tons of ablations and a nice interpretation of Adam as an online variational inference method! And super proud they used plainLM to train "over 1,300 models across different data and scales" (: github.com/Niccolo-Ajrold…
Adam is similar to many algorithms, but cannot be effectively replaced by any simpler variant in LMs. The community is starting to get the recipe right, but what is the secret sauce? @gowerrobert and I found that it has to do with the beta parameters and variational inference.…
We have a new SSM theory paper, just accepted to COLT, revisiting recall properties of linear RNNs. It's surprising how much one can delve into, and how beautiful it can become. With (and only thanks to) the amazing Alexandre and @BachFrancis arxiv.org/pdf/2502.09287
Adam is similar to many algorithms, but cannot be effectively replaced by any simpler variant in LMs. The community is starting to get the recipe right, but what is the secret sauce? @gowerrobert and I found that it has to do with the beta parameters and variational inference.…
Everybody gangsta until SDEs work! #ICLR2025 Noise hits every optimizer differently! For SignSGD, adaptivity increases resistance to gradient noise, while AdamW enjoys extreme stability. Plus, AdamW has an exciting new scaling rule! More below👇! arxiv.org/abs/2411.15958
@orvieto_antonio complimenti per l’intervento di questo pomeriggio in Napoli che ho seguito su YouTube magari saremmo lieti di poterti ascoltare dal vivo prima o poi in PLACEBO FOUNDATION nella povera Lucania
🚨 NEW PAPER DROP! Wouldn't it be nice if LLMs could spot and correct their own mistakes? And what if we could do so directly from pre-training, without any SFT or RL? We present a new class of discrete diffusion models, called GIDD, that are able to do just that: 🧵1/12
🚀 Stronger performance, better privacy — no compromises! 📖 Check it out for more details! 🔗 arxiv.org/abs/2502.11682 Joint work with @sam_hrvth, @AurelienLucchi, @peter_richtarik, @ed_gorbunov