Kwangjun Ahn
@KwangjunA
Senior Researcher at Microsoft Reserach // PhD from MIT EECS
[1/6] Curious about Muon, but not sure where to start? I wrote a 3-part blog series called “Understanding Muon” designed to get you up to speed—with The Matrix references, annotated source code, and thoughts on where Muon might be going.
Apparently Dion is now being worked on for Torch Titan: github.com/pytorch/torcht… :-)
Since nobody asked :-), here is my list of papers not to be missed from ICML: 1) Dion: distributed orthonormalized updates (well, technically not at ICML, but everyone's talking about it). 2) MARS: Unleashing the Power of Variance Reduction for Training Large Models 3) ...
Since nobody asked :-), here is my list of papers not to be missed from ICML: 1) Dion: distributed orthonormalized updates (well, technically not at ICML, but everyone's talking about it). 2) MARS: Unleashing the Power of Variance Reduction for Training Large Models 3) ...
But actually this is the og way of doing it and should stop by E-2103 to see @jxbz and Laker Newhouse whiteboard the whole paper.
Laker and I are presenting this work in an hour at ICML poster E-2103. It’s on a theoretical framework and language (modula) for optimizers that are fast (like Shampoo) and scalable (like muP). You can think of modula as Muon extended to general layer types and network topologies
Schedule-Free methods, which forgo cosine/linear schedulers by averaging iterates and computing gradients at interpolated points, yield smoother training curves. It's still unclear why they work well, and this paper explains the phenomenon through the river-valley loss landscape.
If you are at ICML 2025, come check out our oral presentation about the non-convex theory of Schedule Free SGD in the Optimization session tomorrow! This work was done with amazing collaborators @KwangjunA and @AshokCutkosky.
ICML: come check out our Oral Presentation on Schedule-free training theory based on an elegant online learning!

and also Dion by @KwangjunA, @JohnCLangford et al arxiv.org/abs/2504.05295
ICLR: @edward_s_hu and I will be presenting our work "The Belief State Transformer" at the 1st poster session. (#269) Please come check it out! (github: github.com/microsoft/BST)
The Belief State Transformer edwardshu.com/bst-website/ is at ICLR this week. The BST objective efficiently creates compact belief states: summaries of the past sufficient for all future predictions. See the short talk: microsoft.com/en-us/research… and @mgostIH for further discussion.
New reqs for low to high level researcher positions: jobs.careers.microsoft.com/global/en/job/… , jobs.careers.microsoft.com/global/en/job/…, jobs.careers.microsoft.com/global/en/job/…, jobs.careers.microsoft.com/global/en/job/…, with postdocs from Akshay and @MiroDudik . Please apply or pass to those who may :-)
Last year, we had offers accepted from @KwangjunA, @riashatislam, @Tea_Pearce , @pratyusha_PS while Akshay and @MiroDudik hired 7(!) postdocs.
New reqs for low to high level researcher positions: jobs.careers.microsoft.com/global/en/job/… , jobs.careers.microsoft.com/global/en/job/…, jobs.careers.microsoft.com/global/en/job/…, jobs.careers.microsoft.com/global/en/job/…, with postdocs from Akshay and @MiroDudik . Please apply or pass to those who may :-)
Come to my presentation of ICML 2024 paper tmrw at 1:30–3 pm! We provide a new perspective on Adam optimizer based on online learning. In particular, our perspective shows the importance of Adam's key components. (video: youtu.be/AU39SNkkIsA)

What's the opt optimizer? New work comparing (diagonally conditioned) first order methods.
NEW #KempnerInstitute blog: @rosieyzh, @depen_morwani, @brandfonbrener, @vyasnikhil96 & @ShamKakade6 study a variety of #LLM training optimizers and find they are all fairly similar except for SGD, which is notably worse. Read more: bit.ly/3S5PmZk #ML #AI
In our ICML 2024 paper (@icmlconf), joint w/ Zhiyu Zhang (@imZhiyuZ), Yunbum Kook, Yan Dai, we provide a new perspective on Adam optimizer based on online learning. In particular, our perspective shows the importance of Adam's key components. (video: youtu.be/AU39SNkkIsA)

I successfully defended my thesis at MIT EECS yesterday! A huge thank you to my advisors, Suvrit and Ali, and my committee Ashia! It talks about my recent works on Transformers and Adam those who are interested, check out the video: youtu.be/5rgrB7TGPdc

Exciting new paper by Kwangjun Ahn (@KwangjunA) and Ashok Cutkosky (@AshokCutkosky)! Adam with model exponential moving average is effective for nonconvex optimization arxiv.org/pdf/2405.18199 This approach to analyzing Adam is extremely promising IMHO.
If you're at #NeurIPS2023, @KwangjunA will be presenting his work on SpecTr++ in Optimal Transport workshop where he discusses improved transport plans for speculative decoding.
[Today 5pm poster 401 #NeurIPS2023] Is your LLM inference too slow? We achieve 2.13x wallclock speedup in sampling from SOTA LLMs with 𝐩𝐫𝐨𝐯𝐚𝐛𝐥𝐲 no quality sacrifice. How? We use a cheap LM to draft 𝐦𝐮𝐥𝐭𝐢𝐩𝐥𝐞 samples; scored with the LLM to accept/reject tokens.
Check out Xiang Cheng’s talk on our linear transformer works given at Simons Institute!! youtube.com/live/PnwC74s1n…