Tony S.F.
@tonysilveti
Ass. Prof. (maître de conférences) of artificial intelligence at @CentraleSupelec in the Centre pour la Vision Numérique. Vélotaffeur 🇲🇽/🇺🇸
We also provide the first convergence rate analysis that I'm aware of for stochastic unconstrained Frank-Wolfe (i.e., without weight decay), which directly covers the muon optimizer (and much more)! Tagging people who might be interested: @jxbz @kellerjordan0 @YouJiacheng
🔥 Want to train large neural networks WITHOUT Adam while using less memory and getting better results? ⚡ Check out SCION: a new optimizer that adapts to the geometry of your problem using norm-constrained linear minimization oracles (LMOs): 🧵👇
🚡 Come check out our poster on understanding LR schedules at ICML. Thursday 11am.
We'll present our work, "CHAMELEON: A Flexible Data-mixing Framework for Language Model Pretraining and Finetuning," at #ICML2025! This is joint work with Francesco Tonin and @CevherLIONS. 📍 Find us at Poster E-2807 from 11 AM today. Excited to connect and discuss!
I'm looking forward to presenting our work on "Training Deep Learning Models with Norm-Constrained LMOs" today from 11am at ICML, which is joined work with a bunch of incredible people @WanyunXie Kimon Zhenyu @tonysilveti @CevherLIONS 👇
Training neural networks at any scale tutorial with @CevherLIONS and @leenaCvankadara #ICML2025


Excited to give a tutorial with @leenaCvankadara on Training Neural Networks at Any Scale (TRAINS) @icmlconf at 13:30 (West Ballroom A). Our slides can be found here: go.epfl.ch/ICML25TRAINS Please join us.
Excited to announce our recent work on low-precision deep learning via biologically-inspired noisy log-normal multiplicative dynamics (LMD). It allows us to train large neural nets (such as GPT-2 and ViT) in FP6. arxiv.org/abs/2506.17768
This is an attention variant worth looking at! Seems to be inspired by prior work of @danielmurfet that wasn't explored very much in practice, I wonder what other gems are just waiting to be scaled up.

@tonysilveti on lmo and spectral methods comparison, frank wolfe stochastic gradient descent n adam meet after a decade @icmlconf github.com/LIONS-EPFL/sci…
This is happening today, 6pm Paris time!
Don't forget to join us tomorrow, July 3rd as we host @tonysilveti for a session on "Training neural networks at any scale" Learn more: cohere.com/events/Cohere-…
Don't forget to join us tomorrow, July 3rd as we host @tonysilveti for a session on "Training neural networks at any scale" Learn more: cohere.com/events/Cohere-…
Join our ML Theory group next week as they welcome @tonysilveti on July 3rd for a presentation on "Training neural networks at any scale" Thanks to @itsmaddox_j @aniervs and @ThangChu77 for organizing this session 👏 Learn more: cohere.com/events/Cohere-…
An excellent point and one that I make often when I am talking about Scion - making the optimizer aware of the architecture it's optimizing is key to things like hyperparameter transfer. We can see this also in µP, which uses per-layer scalings of the learning rate based on dim.
One thing that feels strange about AdamW is that it treats all network parameters identically - norm layers, attention, and dense layers all get the same update rule. Classical optimization, in contrast, uses tricks such as mirror descent with tailored mirror map to significantly…
Join our ML Theory group next week as they welcome @tonysilveti on July 3rd for a presentation on "Training neural networks at any scale" Thanks to @itsmaddox_j @aniervs and @ThangChu77 for organizing this session 👏 Learn more: cohere.com/events/Cohere-…
Whether or not you should cite a related work has *nothing* to do with whether or not the authors "want your citation", actually.
There have been hundreds of optimizer papers published. But the SOTA has only improved a few times. Therefore we can conclude that almost all optimizer papers are fake. If you're gonna write another fake optimizer paper, please don't cite Muon. I don't want your citation
Is there a mathematical version of this, where we start to write our papers for the LLM to digest rather than for a human reader?
Slowly people are starting to understand that it is ridiculous to optimize for human ergonomics instead of creating an AI first programming ecosystem. Read my manifesto here: drive.google.com/file/d/1KbwH2z…