Seunghyun Seo
@SeunghyunSEO7
deep learning enjoyer. from speech to llm, now image.
btw, i wrote a post about "how to scale" based on what i've learned over the past few months. it covers muP, HP scaling laws, and some stuffs. would be happy to get any feedback or discussion. (it's pretty verbose and no TL;DR, sorry lol) howtoscalenn.github.io

True. It's a Logit hard-capping with capping factors absorbed to the weight.
correction: actually there is a clamp_max on η. (equivalently, rescaling only happens if max(qk) > t)
ReLU MLP with width / depth going to infinity. Note how different parameterization makes pathlogical scaling behavior (yellow / blue on activations / gradients of the weight). muP solves this.
chat, i just realized parallel layer is not compatible with per layer optimization stuff like muP. how should i apply 1/(4*dim) for mlp2 and 1/dim for o_proj? i guess it seems not good for muon because it doesn't prefer rectangular matrix IIUC.

scaling law is all you need. only granularity and expert sharing matter.
Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models Impressive results. This resolves various problems around MoE all at once. First it reconfirms that MoE has a higher ratio of optimal data size relative to computational cost compared to…
Now that PyTorch is following up, I am wondering when Nvidia Megatron-LM will follow up. Moonshot had a PR almost half a year ago: github.com/NVIDIA/Megatro… @NVIDIAAI
considering Muon is so popular and validated at scale, we've just decided to welcome a PR for it in PyTorch core by default. If anyone wants to take a crack at it... github.com/pytorch/pytorc…
H-Nets are the future.
H-Net introduces several technical components, including a similarity-score routing module and EMA-based smoothing module, to allow learning discrete chunk boundaries stably. And because it’s fully end-to-end, H-Net can be *recursively iterated* to more stages of hierarchy! 3/
Actually, Single-Scale RMSNorm was also introduced in speedrun... While Fern record's changes (see x.com/hi_tysam/statu…) on adding scalars to (parameter-free-)RMSNorm was not merged, we still have SSNorm in speedrun: (v = norm(v), v = lambdas[0] * v)
i also keep my eyes on this work. do you guys remember ‘privileged bases’ can arise when training NN with adam(w)? so authors use muon and simplified rmsnorm (and adamw on embedding layer) to fix this issue and it seems to work. they prove it in 1.4B model and 1T token scale!
but i am still concerning if it works to scale only total param. to my best knowledge, ffn is for knowledge and attn is for reasoning. and kimi k2’s activated param is lower than dsv3 and it’s attn head num is also smaller. arxiv.org/abs/2410.19034
Shaowei from our infra team actually wrote about the decisions we made on the Kimi K2 architecture. zhihu.com/question/19271… I suggest reading it with Kimi K2 as your translator. :)
Btw attn logit explosion issue is actually raised in moonlight paper (moonshot’s former large scale muon work). i love how these two chinese group, deepseek and moonshot works. they just plan and slay issues step by step. arxiv.org/abs/2502.16982
🚀 Hello, Kimi K2! Open-Source Agentic Model! 🔹 1T total / 32B active MoE model 🔹 SOTA on SWE Bench Verified, Tau2 & AceBench among open models 🔹Strong in coding and agentic tasks 🐤 Multimodal & thought-mode not supported for now With Kimi K2, advanced agentic intelligence…
한국이 frontier model을 만들 수 없는 이유 중 가장 큰 이유라고 생각한다. 국내의 한 대기업도 나가는 이만 있고 들어오는 이는 없다. AI에 100조를 투자하겠다는 현 정부의 전략에도 그들은 대규모 gpu cluster를 사겠다는 것 외에 인재를 영입하는 것, 즉 생태계를 만드는데는 관심이 없어보인다.
이 56명이라는 숫자가 서울대에게 큰 손실로 느껴지는 이유는, 서울대가 반대로 해외 우수 대학의 교수를 recruiting 해 오는 일이 없기 때문입니다. (11/N)
I was intended to write a blog about Muon's infra scalability but basically it was what @SeunghyunSEO7 mentioned: it is caused by the zero 1 impl difference and dim 0 sharding is not scalable for Moonshot's impl. So I just ended up writing some fun thoughts regarding Muon's…
we've discussed this for a while. legacy megatron fsdp1 flatten all weights first and slice. but fsdp2 use per param sharding and it's useful for qlora like things. but moohshot's internel codebase is based on old megatron i guess so it does not match with fsdp2 well