Seunghyun Seo

@SeunghyunSEO7

deep learning enjoyer. from speech to llm, now image.

South Korea

Joined July 2021

754Following

2KFollowers

Pinned

Seunghyun Seo@SeunghyunSEO7 · May 13

btw, i wrote a post about "how to scale" based on what i've learned over the past few months. it covers muP, HP scaling laws, and some stuffs. would be happy to get any feedback or discussion. (it's pretty verbose and no TL;DR, sorry lol) howtoscalenn.github.io

SeunghyunSEO7's tweet image. btw, i wrote a post about "how to scale" based on what i've learned over the past few months. it covers muP, HP scaling laws, and some stuffs. would be happy to get any feedback or discussion.
(it's pretty verbose and no TL;DR, sorry lol)
howtoscalenn.github.io

678

813

53.0K

Pinned

Seunghyun Seo@SeunghyunSEO7 · Jul 11

True. It's a Logit hard-capping with capping factors absorbed to the weight.

YYou Jiacheng@YouJiacheng · Jul 11

correction: actually there is a clamp_max on η. (equivalently, rescaling only happens if max(qk) > t)

2.0K

Seunghyun Seo Retweeted

Simo Ryu@cloneofsimo · Jul 25

ReLU MLP with width / depth going to infinity. Note how different parameterization makes pathlogical scaling behavior (yellow / blue on activations / gradients of the weight). muP solves this.

284

211

16.0K

Seunghyun Seo@SeunghyunSEO7 · Jul 24

chat, i just realized parallel layer is not compatible with per layer optimization stuff like muP. how should i apply 1/(4*dim) for mlp2 and 1/dim for o_proj? i guess it seems not good for muon because it doesn't prefer rectangular matrix IIUC.

SeunghyunSEO7's tweet image. chat, i just realized parallel layer is not compatible with per layer optimization stuff like muP. how should i apply 1/(4*dim) for mlp2 and 1/dim for o_proj? i guess it seems not good for muon because it doesn't prefer rectangular matrix IIUC.

998

Seunghyun Seo@SeunghyunSEO7 · Jul 24

scaling law is all you need. only granularity and expert sharing matter.

RRosinality@rosinality · Jul 24

Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models Impressive results. This resolves various problems around MoE all at once. First it reconfirms that MoE has a higher ratio of optimal data size relative to computational cost compared to…

1.0K

Seunghyun Seo@SeunghyunSEO7 · Jul 16

Now that PyTorch is following up, I am wondering when Nvidia Megatron-LM will follow up. Moonshot had a PR almost half a year ago: github.com/NVIDIA/Megatro… @NVIDIAAI

SSoumith Chintala@soumithchintala · Jul 16

considering Muon is so popular and validated at scale, we've just decided to welcome a PR for it in PyTorch core by default. If anyone wants to take a crack at it... github.com/pytorch/pytorc…

6.0K

Seunghyun Seo@SeunghyunSEO7 · Jul 14

H-Nets are the future.

SSukjun (June) Hwang@sukjun_hwang · Jul 11

H-Net introduces several technical components, including a similarity-score routing module and EMA-based smoothing module, to allow learning discrete chunk boundaries stably. And because it’s fully end-to-end, H-Net can be *recursively iterated* to more stages of hierarchy! 3/

717

460

120.0K

Seunghyun Seo@SeunghyunSEO7 · Jul 14

Actually, Single-Scale RMSNorm was also introduced in speedrun... While Fern record's changes (see x.com/hi_tysam/statu…) on adding scalars to (parameter-free-)RMSNorm was not merged, we still have SSNorm in speedrun: (v = norm(v), v = lambdas[0] * v)

SSeunghyun Seo@SeunghyunSEO7 · Jul 14

i also keep my eyes on this work. do you guys remember ‘privileged bases’ can arise when training NN with adam(w)? so authors use muon and simplified rmsnorm (and adamw on embedding layer) to fix this issue and it seems to work. they prove it in 1.4B model and 1T token scale!

3.0K

Seunghyun Seo@SeunghyunSEO7 · Jul 14

but i am still concerning if it works to scale only total param. to my best knowledge, ffn is for knowledge and attn is for reasoning. and kimi k2’s activated param is lower than dsv3 and it’s attn head num is also smaller. arxiv.org/abs/2410.19034

YYulun Du@Yulun_Du · Jul 14

Shaowei from our infra team actually wrote about the decisions we made on the Kimi K2 architecture. zhihu.com/question/19271… I suggest reading it with Kimi K2 as your translator. :)

4.0K

Seunghyun Seo@SeunghyunSEO7 · Jul 13

Btw attn logit explosion issue is actually raised in moonlight paper (moonshot’s former large scale muon work). i love how these two chinese group, deepseek and moonshot works. they just plan and slay issues step by step. arxiv.org/abs/2502.16982

KKimi.ai@Kimi_Moonshot · Jul 11

🚀 Hello, Kimi K2! Open-Source Agentic Model! 🔹 1T total / 32B active MoE model 🔹 SOTA on SWE Bench Verified, Tau2 & AceBench among open models 🔹Strong in coding and agentic tasks 🐤 Multimodal & thought-mode not supported for now With Kimi K2, advanced agentic intelligence…

2.0K

Seunghyun Seo@SeunghyunSEO7 · Jul 13

한국이 frontier model을 만들 수 없는 이유 중 가장 큰 이유라고 생각한다. 국내의 한 대기업도 나가는 이만 있고 들어오는 이는 없다. AI에 100조를 투자하겠다는 현 정부의 전략에도 그들은 대규모 gpu cluster를 사겠다는 것 외에 인재를 영입하는 것, 즉 생태계를 만드는데는 관심이 없어보인다.

EErnest Ryu@ErnestRyu · Jul 11

이 56명이라는 숫자가 서울대에게 큰 손실로 느껴지는 이유는, 서울대가 반대로 해외 우수 대학의 교수를 recruiting 해 오는 일이 없기 때문입니다. (11/N)

1.0K

Seunghyun Seo@SeunghyunSEO7 · Jul 12

I was intended to write a blog about Muon's infra scalability but basically it was what @SeunghyunSEO7 mentioned: it is caused by the zero 1 impl difference and dim 0 sharding is not scalable for Moonshot's impl. So I just ended up writing some fun thoughts regarding Muon's…

SSeunghyun Seo@SeunghyunSEO7 · Jul 12

we've discussed this for a while. legacy megatron fsdp1 flatten all weights first and slice. but fsdp2 use per param sharding and it's useful for qlora like things. but moohshot's internel codebase is based on old megatron i guess so it does not match with fsdp2 well

5.0K