Aaron Defazio

@aaron_defazio

Research Scientist at Meta working on optimization. Fundamental AI Research (FAIR) team

Manhattan, NY

Joined March 2021

582Following

8KFollowers

Pinned

Aaron Defazio@aaron_defazio · Aug 1

Schedule-Free Wins AlgoPerf Self-Tuning Track 🎉 I'm pleased to announce that Schedule-Free AdamW set a new SOTA for self-tuning training algorithms, besting AdamW and all other submissions by 8% overall. Try it out: github.com/facebookresear…

MMLCommons@MLCommons · Aug 1

@MLCommons #AlgoPerf results are in! 🏁 $50K prize competition yielded 28% faster neural net training with non-diagonal preconditioning beating Nesterov Adam. New SOTA for hyperparameter-free algorithms too! Full details in our blog. mlcommons.org/2024/08/mlc-al… #AIOptimization #AI

278

101

115.0K

Aaron Defazio Retweeted

alex peysakhovich 🤖@alex_peys · Jul 20

having been burned by this a few times already in my life, im cautious, but... maybe RL is gonna work this time?

4.0K

Aaron Defazio Retweeted

Mikhail Parakhin@MParakhin · Jul 18

Since nobody asked :-), here is my list of papers not to be missed from ICML: 1) Dion: distributed orthonormalized updates (well, technically not at ICML, but everyone's talking about it). 2) MARS: Unleashing the Power of Variance Reduction for Training Large Models 3) ...

432

448

65.0K

Aaron Defazio Retweeted

Francis Bach@BachFrancis · Jul 18

Tired of lengthy computations to derive scaling laws? This post is made for you: discover the sharpness of the z-transform! francisbach.com/z-transform/

265

180

18.0K

Aaron Defazio@aaron_defazio · Jul 15

Our Goedel-Prover-V2 doubled the SOTA Pass@32 performance on PutnamBench with a 20x smaller model, making it the strongest open-source theorem prover to date!

YYong Lin@Yong18850571 · Jul 15

(1/4)🚨 Introducing Goedel-Prover V2 🚨 🔥🔥🔥 The strongest open-source theorem prover to date. 🥇 #1 on PutnamBench: Solves 64 problems—with far less compute. 🧠 New SOTA on MiniF2F: * 32B model hits 90.4% at Pass@32, beating DeepSeek-Prover-V2-671B’s 82.4%. * 8B > 671B: Our 8B…

12.0K

Aaron Defazio Retweeted

Konstantin Mishchenko@konstmish · Jul 15

Schedule-Free methods, which forgo cosine/linear schedulers by averaging iterates and computing gradients at interpolated points, yield smoother training curves. It's still unclear why they work well, and this paper explains the phenomenon through the river-valley loss landscape.

140

101

11.0K

Aaron Defazio Retweeted

Kwangjun Ahn@KwangjunA · Jul 15

ICML: come check out our Oral Presentation on Schedule-free training theory based on an elegant online learning!

3.0K

Aaron Defazio@aaron_defazio · Jul 12

For the record, we use WSD learning schedule. The sudden drop in loss at around 11T tokens is just learning rate starting to decay.

wwh@nrehiew_ · Jul 12

Since this figure is going around, the sudden drop corresponds to a very standard learning rate decay and has nothing to do with MuonClip. The point of this figure in the context of MuonClip was to show how stable the training was at such large scales

541

208

53.0K

Aaron Defazio Retweeted

Kyunghyun Cho@kchonyc · Jul 12

late 1980s, @ylecun and @LeonBottou used amiga 1000 and a bespoke modem to implement and research artificial neural nets using SN-1. the legend was born.

765

94.0K

Aaron Defazio@aaron_defazio · Jul 10

Very nice plots, nice poster design, and nice work overall:

FFabian Schaipp@FSchaipp · Jul 10

🚡 Come check out our poster on understanding LR schedules at ICML. Thursday 11am.

8.0K

Aaron Defazio Retweeted

Xidulu@xidulu · Jul 11

Also I think whether SGD is "competitive" is a fairly subjective concept? In your figure a gap of validation loss ~0.2 is considered slightly worse or competitive but in arxiv.org/abs/2407.07972 (Reference 18 in your paper), this gap is considered significant

1.0K

Aaron Defazio Retweeted

Fabian Schaipp@FSchaipp · Jul 10

🚡 Come check out our poster on understanding LR schedules at ICML. Thursday 11am.

118

15.0K

Aaron Defazio@aaron_defazio · Jul 7

🤔

11a3orn@1a3orn · Jul 7

But if AI is on an exponential curve, than every point on the curve should be exactly as wild as every other point, as Euler (of course) proved.

2.0K

Aaron Defazio Retweeted

Simo Ryu@cloneofsimo · Jul 5

437

32.0K

Aaron Defazio Retweeted

jxmo@jxmnop · Jul 3

two or three years ago, this was a prevailing sentiment “training the largest LLMs is very hard. only a few people know how. everyone else is failing” it was hard to avoid huge loss spikes. now pretraining is a solved problem. what changed? do we just clean our data better?

1.0K

484

146.0K

Aaron Defazio@aaron_defazio · Jul 3

There is a lot of research alpha in looking at why things don’t work (when it’s plausible they should). Much better than looking at things that we know work and trying to figure out why.

3.0K

Aaron Defazio Retweeted

Dimitri Bertsekas@DBertsekas · Jul 3

I am pleased to share my new paper (joint with Yuchao Li) on Error Bounds for Aggregation Methods. web.mit.edu/dimitrib/www/A… In my view, aggregation is an under-appreciated off-line training approach in #reinforcementlearning

4.0K

Aaron Defazio@aaron_defazio · Jul 2

AdamC results have replicated. C-AdamW results have replicated (and tops the Heavyball charts). Has C-AdamC been tried yet?

WWill Held@WilliamBarrHeld · Jul 1

While doing WSD cooldowns for the marin.community project, this gradient increase led to problematic loss ascent. We patched it with Z-loss, but AdamC feels better™️. So over the weekend, I ran 4 experiments—130M to 1.4B params—all at ~compute-optimal token counts...🧵

3.0K

Aaron Defazio Retweeted

Andrey Gromov@Andr3yGR · Jul 1

New paper! Collaboration with @TianyuHe_ and Aditya Cowsik. Thread.🧵

172

174

24.0K

Aaron Defazio@aaron_defazio · Jul 1

Nice independent replication of AdamC! It mitigates gradient norm issues and even shows improved loss.

WWill Held@WilliamBarrHeld · Jul 1

TL;DR: 3/4 of our scales we find the AdamC results to reproduce out of the box! With all other factors held constant, AdamC mitigates the gradient ascent at the end of training and leads to an overall lower loss (-0.04)!

3.0K

Aaron Defazio@aaron_defazio · Jul 1

I took this class. Good times! Thank you @ZoubinGhahrama1 and @geoffreyhinton!! Yeah, no backprop. I view this as more the “modeling phase” of deep learning vs “scale”. I’m going with the ideas are still relevant for AI4science.

ZZoubin Ghahramani@ZoubinGhahrama1 · Jun 28

This is the syllabus of the course @geoffreyhinton and I taught in 1998 at the Gatsby Unit (just after it was founded). Notice anything?

16.0K