Zach Mueller

@TheZachMueller

Let's make billions of parameters go brr http://bit.ly/scratchtoscale

In a SLURM cluster

Joined April 2016

532Following

12KFollowers

Pinned

Zach Mueller@TheZachMueller · Jul 24

Distributed training has its own dialect. I made a pocket dictionary so you don’t open 50 browser tabs every time a paper mentions “ZeRO-offload.” 49 terms, crisp definitions, diagrams where they actually help. Grab it, skim it, get back to training. distributedlexicon(.)com

TheZachMueller's tweet image. Distributed training has its own dialect.
I made a pocket dictionary so you don’t open 50 browser tabs every time a paper mentions “ZeRO-offload.”

49 terms, crisp definitions, diagrams where they actually help.
Grab it, skim it, get back to training.

distributedlexicon(.)com

575

695

50.0K

Pinned

Zach Mueller@TheZachMueller · 16 h

>activation “checkpointing” >look inside >saves _fewer_ activations than normal and does recomputation instead >???

ZZach Mueller@TheZachMueller · Jul 25

Activation checkpointing (also called gradient checkpointing) saves GPU memory by storing just a small set of “checkpoint” activations during the forward pass instead of keeping every intermediate tensor. When the backward pass needs those missing activations, it briefly re-runs…

4.0K

Zach Mueller Retweeted

tender@tenderizzation · 5 h

that you will definitely not procrastinate learning about

264

9.0K

Zach Mueller@TheZachMueller · 14 h

Game day time 🐥

409

Zach Mueller@TheZachMueller · 15 h

Pre-game meal before O’s tonight with my father (topic of discussion ORPO)

563

Zach Mueller Retweeted

Mark Saroufim@marksaroufim · 19 h

On Sep 6 in NYC, this won't be your typical hackathon where you do your own thing in a corner and then present at the of the day. You'll deploy real models to the market, trades will happen, chaos should be expected. The fastest model is great but time to market matters more.

9.0K

Zach Mueller Retweeted

difficultyang@difficultyang · 21 h

Expert parallelism actually just tensor parallelism on the batch dimension

7.0K

Zach Mueller Retweeted

Zach Mueller@TheZachMueller · Jul 24

The pocket dictionary stopped the Wikipedia spiral. The course ends the guessing: • build DDP from scratch • shard models with ZeRO • pipeline & tensor-parallel on (free) Modal GPUs 14 guest talks from Hugging Face, Meta, Snowflake and more. Starts Sept 1.

12.0K

Zach Mueller Retweeted

Nathan Lambert@natolambert · 21 h

Chinese open models being at the frontier today isn’t thaaat surprising for folks who were following closely. What is more surprising is how quickly Llama disappeared from the clear top spot in the conversation. Two trends combine for massive impact.

461

28.0K

Zach Mueller@TheZachMueller · 22 h

Glad so many of you are enjoying the Lexicon! I'll update it every so often with new terms and definitions. Esp as it's the top of google rankings 👀🤩

TheZachMueller's tweet image. Glad so many of you are enjoying the Lexicon! I'll update it every so often with new terms and definitions. Esp as it's the top of google rankings 👀🤩

424

Zach Mueller Retweeted

clem 🤗@ClementDelangue · 22 h

When you realize that open-source is at the frontier of AI despite: - less GPUs - less money - less public and policy support - no $100M salaries to attract talent - with closed-source taking advantage and copying all the innovations of open-source without contributing back…

143

1.0K

189

84.0K

Zach Mueller@TheZachMueller · Jul 25

In college I was part of a team doing autonomous drone racing. One of the challenges was to identify the corners. Everyone else was trying to use RCNN & complicated solutions. I had the idea of just putting bounding boxes on each corner of a gate and take the middle. We got 3rd

RRhys@RhysSullivan · Jul 23

share a piece of software dev lore about yourself

968

Zach Mueller@TheZachMueller · Jul 25

“we still have a lot of gifts for u guys!” We have o4-mini at home, and there’s *more*?? 🤩🤩

JJunyang Lin@JustinLin610 · Jul 25

the thinking model finally comes at the end of this week. okay we ve made some improvements on qwen3 to make this thinking 2507. better perf on a number of tasks and more generalized perf. hope it can solve some problems that u met with qwen3-235b-a22b. but i have to say, there…

591

Zach Mueller Retweeted

Qwen@Alibaba_Qwen · Jul 25

🚀 We’re excited to introduce Qwen3-235B-A22B-Thinking-2507 — our most advanced reasoning model yet! Over the past 3 months, we’ve significantly scaled and enhanced the thinking capability of Qwen3, achieving: ✅ Improved performance in logical reasoning, math, science & coding…

173

567

4.0K

830

657.0K

Zach Mueller Retweeted

Alamo Drafthouse@alamodrafthouse · Jul 24

The second half of your life begins when you realize you can go to the movies alone.

439

4.0K

157.0K