Zach Mueller
@TheZachMueller
Let's make billions of parameters go brr http://bit.ly/scratchtoscale
Distributed training has its own dialect. I made a pocket dictionary so you don’t open 50 browser tabs every time a paper mentions “ZeRO-offload.” 49 terms, crisp definitions, diagrams where they actually help. Grab it, skim it, get back to training. distributedlexicon(.)com

>activation “checkpointing” >look inside >saves _fewer_ activations than normal and does recomputation instead >???
Activation checkpointing (also called gradient checkpointing) saves GPU memory by storing just a small set of “checkpoint” activations during the forward pass instead of keeping every intermediate tensor. When the backward pass needs those missing activations, it briefly re-runs…
Pre-game meal before O’s tonight with my father (topic of discussion ORPO)

On Sep 6 in NYC, this won't be your typical hackathon where you do your own thing in a corner and then present at the of the day. You'll deploy real models to the market, trades will happen, chaos should be expected. The fastest model is great but time to market matters more.
Expert parallelism actually just tensor parallelism on the batch dimension
The pocket dictionary stopped the Wikipedia spiral. The course ends the guessing: • build DDP from scratch • shard models with ZeRO • pipeline & tensor-parallel on (free) Modal GPUs 14 guest talks from Hugging Face, Meta, Snowflake and more. Starts Sept 1.
Chinese open models being at the frontier today isn’t thaaat surprising for folks who were following closely. What is more surprising is how quickly Llama disappeared from the clear top spot in the conversation. Two trends combine for massive impact.
Glad so many of you are enjoying the Lexicon! I'll update it every so often with new terms and definitions. Esp as it's the top of google rankings 👀🤩

When you realize that open-source is at the frontier of AI despite: - less GPUs - less money - less public and policy support - no $100M salaries to attract talent - with closed-source taking advantage and copying all the innovations of open-source without contributing back…
In college I was part of a team doing autonomous drone racing. One of the challenges was to identify the corners. Everyone else was trying to use RCNN & complicated solutions. I had the idea of just putting bounding boxes on each corner of a gate and take the middle. We got 3rd
share a piece of software dev lore about yourself
“we still have a lot of gifts for u guys!” We have o4-mini at home, and there’s *more*?? 🤩🤩
the thinking model finally comes at the end of this week. okay we ve made some improvements on qwen3 to make this thinking 2507. better perf on a number of tasks and more generalized perf. hope it can solve some problems that u met with qwen3-235b-a22b. but i have to say, there…
🚀 We’re excited to introduce Qwen3-235B-A22B-Thinking-2507 — our most advanced reasoning model yet! Over the past 3 months, we’ve significantly scaled and enhanced the thinking capability of Qwen3, achieving: ✅ Improved performance in logical reasoning, math, science & coding…
The second half of your life begins when you realize you can go to the movies alone.