Aaron Defazio
@aaron_defazio
Research Scientist at Meta working on optimization. Fundamental AI Research (FAIR) team
Schedule-Free Wins AlgoPerf Self-Tuning Track 🎉 I'm pleased to announce that Schedule-Free AdamW set a new SOTA for self-tuning training algorithms, besting AdamW and all other submissions by 8% overall. Try it out: github.com/facebookresear…
@MLCommons #AlgoPerf results are in! 🏁 $50K prize competition yielded 28% faster neural net training with non-diagonal preconditioning beating Nesterov Adam. New SOTA for hyperparameter-free algorithms too! Full details in our blog. mlcommons.org/2024/08/mlc-al… #AIOptimization #AI
having been burned by this a few times already in my life, im cautious, but... maybe RL is gonna work this time?
Since nobody asked :-), here is my list of papers not to be missed from ICML: 1) Dion: distributed orthonormalized updates (well, technically not at ICML, but everyone's talking about it). 2) MARS: Unleashing the Power of Variance Reduction for Training Large Models 3) ...
Tired of lengthy computations to derive scaling laws? This post is made for you: discover the sharpness of the z-transform! francisbach.com/z-transform/
Our Goedel-Prover-V2 doubled the SOTA Pass@32 performance on PutnamBench with a 20x smaller model, making it the strongest open-source theorem prover to date!
(1/4)🚨 Introducing Goedel-Prover V2 🚨 🔥🔥🔥 The strongest open-source theorem prover to date. 🥇 #1 on PutnamBench: Solves 64 problems—with far less compute. 🧠 New SOTA on MiniF2F: * 32B model hits 90.4% at Pass@32, beating DeepSeek-Prover-V2-671B’s 82.4%. * 8B > 671B: Our 8B…
Schedule-Free methods, which forgo cosine/linear schedulers by averaging iterates and computing gradients at interpolated points, yield smoother training curves. It's still unclear why they work well, and this paper explains the phenomenon through the river-valley loss landscape.
ICML: come check out our Oral Presentation on Schedule-free training theory based on an elegant online learning!
For the record, we use WSD learning schedule. The sudden drop in loss at around 11T tokens is just learning rate starting to decay.
Since this figure is going around, the sudden drop corresponds to a very standard learning rate decay and has nothing to do with MuonClip. The point of this figure in the context of MuonClip was to show how stable the training was at such large scales
late 1980s, @ylecun and @LeonBottou used amiga 1000 and a bespoke modem to implement and research artificial neural nets using SN-1. the legend was born.
Very nice plots, nice poster design, and nice work overall:
🚡 Come check out our poster on understanding LR schedules at ICML. Thursday 11am.
Also I think whether SGD is "competitive" is a fairly subjective concept? In your figure a gap of validation loss ~0.2 is considered slightly worse or competitive but in arxiv.org/abs/2407.07972 (Reference 18 in your paper), this gap is considered significant
🚡 Come check out our poster on understanding LR schedules at ICML. Thursday 11am.
🤔
But if AI is on an exponential curve, than every point on the curve should be exactly as wild as every other point, as Euler (of course) proved.
two or three years ago, this was a prevailing sentiment “training the largest LLMs is very hard. only a few people know how. everyone else is failing” it was hard to avoid huge loss spikes. now pretraining is a solved problem. what changed? do we just clean our data better?
There is a lot of research alpha in looking at why things don’t work (when it’s plausible they should). Much better than looking at things that we know work and trying to figure out why.
I am pleased to share my new paper (joint with Yuchao Li) on Error Bounds for Aggregation Methods. web.mit.edu/dimitrib/www/A… In my view, aggregation is an under-appreciated off-line training approach in #reinforcementlearning
AdamC results have replicated. C-AdamW results have replicated (and tops the Heavyball charts). Has C-AdamC been tried yet?
While doing WSD cooldowns for the marin.community project, this gradient increase led to problematic loss ascent. We patched it with Z-loss, but AdamC feels better™️. So over the weekend, I ran 4 experiments—130M to 1.4B params—all at ~compute-optimal token counts...🧵
New paper! Collaboration with @TianyuHe_ and Aditya Cowsik. Thread.🧵
Nice independent replication of AdamC! It mitigates gradient norm issues and even shows improved loss.
TL;DR: 3/4 of our scales we find the AdamC results to reproduce out of the box! With all other factors held constant, AdamC mitigates the gradient ascent at the end of training and leads to an overall lower loss (-0.04)!
I took this class. Good times! Thank you @ZoubinGhahrama1 and @geoffreyhinton!! Yeah, no backprop. I view this as more the “modeling phase” of deep learning vs “scale”. I’m going with the ideas are still relevant for AI4science.
This is the syllabus of the course @geoffreyhinton and I taught in 1998 at the Gatsby Unit (just after it was founded). Notice anything?