Jeremy Bernstein
@jxbz
🧪 @thinkymachines ✍️ anon feedback @ http://admonymous.co/jxbz
I just wrote my first blog post in four years! It is called "Deriving Muon". It covers the theory that led to Muon and how, for me, Muon is a meaningful example of theory leading practice in deep learning (1/11)

👀
Apparently Dion is now being worked on for Torch Titan: github.com/pytorch/torcht… :-)
considering Muon is so popular and validated at scale, we've just decided to welcome a PR for it in PyTorch core by default. If anyone wants to take a crack at it... github.com/pytorch/pytorc…
Still a relative newbie, but I am very excited about this team and what we are building
Thinking Machines Lab exists to empower humanity through advancing collaborative general intelligence. We're building multimodal AI that works with how you naturally interact with the world - through conversation, through sight, through the messy way we collaborate. We're…
🚀 Hello, Kimi K2! Open-Source Agentic Model! 🔹 1T total / 32B active MoE model 🔹 SOTA on SWE Bench Verified, Tau2 & AceBench among open models 🔹Strong in coding and agentic tasks 🐤 Multimodal & thought-mode not supported for now With Kimi K2, advanced agentic intelligence…
Holy shit. Kimi K2 was pre-trained on 15.5T tokens using MuonClip with zero training spike. Muon has officially scaled to the 1-trillion-parameter LLM level. Many doubted it could scale, but here we are. So proud of the Moum team: @kellerjordan0, @bozavlado, @YouJiacheng,…
midjourney introduces video generation and it’s surpassing all my expectations.
Announcing 𝐟𝐥𝐚𝐬𝐡-𝐦𝐮𝐨𝐧: a 🐍 pkg with customized CUDA kernel that aims to boost Muon optimizer: github.com/nil0x9/flash-m… 1/n
Pretty wild to see work that I contributed to (e.g., AlgoPerf, Crowded Valley @robinschmidt_) included in a university course. I feel very honored.
Lecture 11: benchmarking optimizers 1. the problem: comparing optimizers (sgd, adam, etc.) in deep learning is tricky. 2. challenge 1: defining "speed". curves cross, so use time-to-result. 3. challenge 2: hyperparameter tuning trap. protocol matters more than algo? (choi et…