Kwangjun Ahn

@KwangjunA

Senior Researcher at Microsoft Reserach // PhD from MIT EECS

Cambridge, MA

Joined February 2020

299Following

593Followers

Kwangjun Ahn Retweeted

Laker Newhouse@LakerNewhouse · Jul 21

[1/6] Curious about Muon, but not sure where to start? I wrote a 3-part blog series called “Understanding Muon” designed to get you up to speed—with The Matrix references, annotated source code, and thoughts on where Muon might be going.

318

435

31.0K

Kwangjun Ahn@KwangjunA · Jul 20

Apparently Dion is now being worked on for Torch Titan: github.com/pytorch/torcht… :-)

MMikhail Parakhin@MParakhin · Jul 18

Since nobody asked :-), here is my list of papers not to be missed from ICML: 1) Dion: distributed orthonormalized updates (well, technically not at ICML, but everyone's talking about it). 2) MARS: Unleashing the Power of Variance Reduction for Training Large Models 3) ...

104

18.0K

Kwangjun Ahn Retweeted

Mikhail Parakhin@MParakhin · Jul 18

432

449

66.0K

Kwangjun Ahn@KwangjunA · Jul 16

But actually this is the og way of doing it and should stop by E-2103 to see @jxbz and Laker Newhouse whiteboard the whole paper.

JJeremy Bernstein@jxbz · Jul 16

Laker and I are presenting this work in an hour at ICML poster E-2103. It’s on a theoretical framework and language (modula) for optimizers that are fast (like Shampoo) and scalable (like muP). You can think of modula as Muon extended to general layer types and network topologies

8.0K

Kwangjun Ahn Retweeted

Konstantin Mishchenko@konstmish · Jul 15

Schedule-Free methods, which forgo cosine/linear schedulers by averaging iterates and computing gradients at interpolated points, yield smoother training curves. It's still unclear why they work well, and this paper explains the phenomenon through the river-valley loss landscape.

141

101

12.0K

Kwangjun Ahn Retweeted

Gagik Magakyan@GagMagakyan · Jul 15

If you are at ICML 2025, come check out our oral presentation about the non-convex theory of Schedule Free SGD in the Optimization session tomorrow! This work was done with amazing collaborators @KwangjunA and @AshokCutkosky.

304

Kwangjun Ahn@KwangjunA · Jul 15

ICML: come check out our Oral Presentation on Schedule-free training theory based on an elegant online learning!

3.0K

Kwangjun Ahn Retweeted

Jeremy Bernstein@jxbz · Jun 21

and also Dion by @KwangjunA, @JohnCLangford et al arxiv.org/abs/2504.05295

996

Kwangjun Ahn Retweeted

You Jiacheng@YouJiacheng · Jun 12

Oh I found them: linear warmup and then constant

2.0K

Kwangjun Ahn@KwangjunA · Apr 23

ICLR: @edward_s_hu and I will be presenting our work "The Belief State Transformer" at the 1st poster session. (#269) Please come check it out! (github: github.com/microsoft/BST)

KwangjunA's tweet card. Contribute to microsoft/BST development by creating an account on GitHub.

648

Kwangjun Ahn Retweeted

John Langford@JohnCLangford · Apr 21

The Belief State Transformer edwardshu.com/bst-website/ is at ICLR this week. The BST objective efficiently creates compact belief states: summaries of the past sufficient for all future predictions. See the short talk: microsoft.com/en-us/research… and @mgostIH for further discussion.

105

16.0K

Kwangjun Ahn Retweeted

John Langford@JohnCLangford · Sep 23

New reqs for low to high level researcher positions: jobs.careers.microsoft.com/global/en/job/… , jobs.careers.microsoft.com/global/en/job/…, jobs.careers.microsoft.com/global/en/job/…, jobs.careers.microsoft.com/global/en/job/…, with postdocs from Akshay and @MiroDudik . Please apply or pass to those who may :-)

108

140

36.0K

Kwangjun Ahn@KwangjunA · Sep 23

Last year, we had offers accepted from @KwangjunA, @riashatislam, @Tea_Pearce , @pratyusha_PS while Akshay and @MiroDudik hired 7(!) postdocs.

JJohn Langford@JohnCLangford · Sep 23

4.0K

Kwangjun Ahn@KwangjunA · Jul 23, 2024

Come to my presentation of ICML 2024 paper tmrw at 1:30–3 pm! We provide a new perspective on Adam optimizer based on online learning. In particular, our perspective shows the importance of Adam's key components. (video: youtu.be/AU39SNkkIsA)

KwangjunA's tweet image. Come to my presentation of ICML 2024 paper tmrw at 1:30–3 pm!
We provide a new perspective on Adam optimizer based on online learning. In particular, our perspective shows the importance of Adam's key components. (video: youtu.be/AU39SNkkIsA)

5.0K

Kwangjun Ahn@KwangjunA · Jul 12, 2024

What's the opt optimizer? New work comparing (diagonally conditioned) first order methods.

KKempner Institute at Harvard University@KempnerInst · Jul 12, 2024

NEW #KempnerInstitute blog: @rosieyzh, @depen_morwani, @brandfonbrener, @vyasnikhil96 & @ShamKakade6 study a variety of #LLM training optimizers and find they are all fairly similar except for SGD, which is notably worse. Read more: bit.ly/3S5PmZk #ML #AI

3.0K

Kwangjun Ahn@KwangjunA · Jun 25, 2024

In our ICML 2024 paper (@icmlconf), joint w/ Zhiyu Zhang (@imZhiyuZ), Yunbum Kook, Yan Dai, we provide a new perspective on Adam optimizer based on online learning. In particular, our perspective shows the importance of Adam's key components. (video: youtu.be/AU39SNkkIsA)

KwangjunA's tweet image. In our ICML 2024 paper (@icmlconf), joint w/ Zhiyu Zhang (@imZhiyuZ), Yunbum Kook, Yan Dai, we provide a new perspective on Adam optimizer based on online learning. In particular, our perspective shows the importance of Adam's key components. (video: youtu.be/AU39SNkkIsA)

6.0K

Kwangjun Ahn@KwangjunA · Jun 7, 2024

I successfully defended my thesis at MIT EECS yesterday! A huge thank you to my advisors, Suvrit and Ali, and my committee Ashia! It talks about my recent works on Transformers and Adam those who are interested, check out the video: youtu.be/5rgrB7TGPdc

KwangjunA's tweet image. I successfully defended my thesis at MIT EECS yesterday! A huge thank you to my advisors, Suvrit and Ali, and my committee Ashia!
It talks about my recent works on Transformers and Adam those who are interested, check out the video:
youtu.be/5rgrB7TGPdc

267

31.0K

Kwangjun Ahn Retweeted

Aaron Defazio@aaron_defazio · May 31, 2024

Exciting new paper by Kwangjun Ahn (@KwangjunA) and Ashok Cutkosky (@AshokCutkosky)! Adam with model exponential moving average is effective for nonconvex optimization arxiv.org/pdf/2405.18199 This approach to analyzing Adam is extremely promising IMHO.

12.0K

Kwangjun Ahn@KwangjunA · Dec 16, 2023

If you're at #NeurIPS2023, @KwangjunA will be presenting his work on SpecTr++ in Optimal Transport workshop where he discusses improved transport plans for speculative decoding.

ZZiteng Sun@SZiteng · Dec 13, 2023

[Today 5pm poster 401 #NeurIPS2023] Is your LLM inference too slow? We achieve 2.13x wallclock speedup in sampling from SOTA LLMs with 𝐩𝐫𝐨𝐯𝐚𝐛𝐥𝐲 no quality sacrifice. How? We use a cheap LM to draft 𝐦𝐮𝐥𝐭𝐢𝐩𝐥𝐞 samples; scored with the LLM to accept/reject tokens.

5.0K

Kwangjun Ahn@KwangjunA · Nov 29, 2023

Check out Xiang Cheng’s talk on our linear transformer works given at Simons Institute!! youtube.com/live/PnwC74s1n…

4.0K