Tony S.F.

@tonysilveti

Ass. Prof. (maître de conférences) of artificial intelligence at @CentraleSupelec in the Centre pour la Vision Numérique. Vélotaffeur 🇲🇽/🇺🇸

Paris

Joined April 2019

332Following

623Followers

Pinned

Tony S.F.@tonysilveti · Feb 13

We also provide the first convergence rate analysis that I'm aware of for stochastic unconstrained Frank-Wolfe (i.e., without weight decay), which directly covers the muon optimizer (and much more)! Tagging people who might be interested: @jxbz @kellerjordan0 @YouJiacheng

VVolkan Cevher@CevherLIONS · Feb 13

🔥 Want to train large neural networks WITHOUT Adam while using less memory and getting better results? ⚡ Check out SCION: a new optimizer that adapts to the geometry of your problem using norm-constrained linear minimization oracles (LMOs): 🧵👇

12.0K

Tony S.F. Retweeted

Fabian Schaipp@FSchaipp · Jul 10

🚡 Come check out our poster on understanding LR schedules at ICML. Thursday 11am.

118

15.0K

Tony S.F. Retweeted

Wanyun Xie@WanyunXie · Jul 16

We'll present our work, "CHAMELEON: A Flexible Data-mixing Framework for Language Model Pretraining and Finetuning," at #ICML2025! This is joint work with Francesco Tonin and @CevherLIONS. 📍 Find us at Poster E-2807 from 11 AM today. Excited to connect and discuss!

276

Tony S.F. Retweeted

Thomas Pethick@tmpethick · Jul 15

I'm looking forward to presenting our work on "Training Deep Learning Models with Norm-Constrained LMOs" today from 11am at ICML, which is joined work with a bunch of incredible people @WanyunXie Kimon Zhenyu @tonysilveti @CevherLIONS 👇

271

Tony S.F.@tonysilveti · Jul 14

Training neural networks at any scale tutorial with @CevherLIONS and @leenaCvankadara #ICML2025

799

Tony S.F. Retweeted

Volkan Cevher@CevherLIONS · Jul 14

Excited to give a tutorial with @leenaCvankadara on Training Neural Networks at Any Scale (TRAINS) @icmlconf at 13:30 (West Ballroom A). Our slides can be found here: go.epfl.ch/ICML25TRAINS Please join us.

101.0K

Tony S.F. Retweeted

Thomas Möllenhoff@tmoellenhoff · Jul 7

Excited to announce our recent work on low-precision deep learning via biologically-inspired noisy log-normal multiplicative dynamics (LMD). It allows us to train large neural nets (such as GPT-2 and ViT) in FP6. arxiv.org/abs/2506.17768

110

13.0K

Tony S.F.@tonysilveti · Jul 6

"Ok Claude, where are Wonka's three remaining golden tickets?"

210

Tony S.F.@tonysilveti · Jul 4

This is an attention variant worth looking at! Seems to be inspired by prior work of @danielmurfet that wasn't explored very much in practice, I wonder what other gems are just waiting to be scaled up.

tonysilveti's tweet image. This is an attention variant worth looking at! Seems to be inspired by prior work of @danielmurfet that wasn't explored very much in practice, I wonder what other gems are just waiting to be scaled up.

811

Tony S.F. Retweeted

priya joseph@ayirpelle · Jul 3

⁦@tonysilveti⁩ on lmo and spectral methods comparison, frank wolfe stochastic gradient descent n adam meet after a decade ⁦@icmlconf⁩ github.com/LIONS-EPFL/sci…

358

Tony S.F.@tonysilveti · Jul 3

This is happening today, 6pm Paris time!

CCohere Labs@Cohere_Labs · Jul 2

Don't forget to join us tomorrow, July 3rd as we host @tonysilveti for a session on "Training neural networks at any scale" Learn more: cohere.com/events/Cohere-…

2.0K

Tony S.F.@tonysilveti · Jul 2

Don't forget to join us tomorrow, July 3rd as we host @tonysilveti for a session on "Training neural networks at any scale" Learn more: cohere.com/events/Cohere-…

CCohere Labs@Cohere_Labs · Jun 26

Join our ML Theory group next week as they welcome @tonysilveti on July 3rd for a presentation on "Training neural networks at any scale" Thanks to @itsmaddox_j @aniervs and @ThangChu77 for organizing this session 👏 Learn more: cohere.com/events/Cohere-…

4.0K

Tony S.F.@tonysilveti · Jul 1

An excellent point and one that I make often when I am talking about Scion - making the optimizer aware of the architecture it's optimizing is key to things like hyperparameter transfer. We can see this also in µP, which uses per-layer scalings of the learning rate based on dim.

KKonstantin Mishchenko@konstmish · Jul 1

One thing that feels strange about AdamW is that it treats all network parameters identically - norm layers, attention, and dense layers all get the same update rule. Classical optimization, in contrast, uses tricks such as mirror descent with tailored mirror map to significantly…

2.0K

Tony S.F.@tonysilveti · Jun 30

'the bounds are tight' bro we are literally in a cave

2.0K

Tony S.F. Retweeted

Cohere Labs@Cohere_Labs · Jun 26

7.0K

Tony S.F.@tonysilveti · Jun 16

Whether or not you should cite a related work has *nothing* to do with whether or not the authors "want your citation", actually.

KKeller Jordan@kellerjordan0 · Jun 15

There have been hundreds of optimizer papers published. But the SOTA has only improved a few times. Therefore we can conclude that almost all optimizer papers are fake. If you're gonna write another fake optimizer paper, please don't cite Muon. I don't want your citation

1.0K

Tony S.F.@tonysilveti · Jun 13

Is there a mathematical version of this, where we start to write our papers for the LLM to digest rather than for a human reader?

EErik Meijer@headinthebox · Jun 12

Slowly people are starting to understand that it is ridiculous to optimize for human ergonomics instead of creating an AI first programming ecosystem. Read my manifesto here: drive.google.com/file/d/1KbwH2z…

250