Daniel Vega-Myhre
@vega_myhre
ML SWE working on PyTorch
Just wrote an illustrated deep-dive into overlapping the compute and comms in TP+SP using Async TP. My eyeballs hurt now so hopefully somebody finds it useful :) danielvegamyhre.github.io/ml/performance…
On Sep 6 in NYC, this won't be your typical hackathon where you do your own thing in a corner and then present at the of the day. You'll deploy real models to the market, trades will happen, chaos should be expected. The fastest model is great but time to market matters more.
🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n
I’m at #ICML2025 today presenting a poster on our paper on TorchAO at the CodeML workshop - come say hey! Paper: openreview.net/forum?id=HpqH0…
When I was at Google, maintaining high training goodput in the face of infra failures was a big challenge we faced for massive distributed training runs. Cool to see progress on fault-tolerant training happening at the framework layer.
the model keeps on training even when the underlying infra keeps failing....out-of-the-box PyTorch
Working with @CrusoeAI's new H200 cluster, tests demonstrated 34–43% #PyTorch training acceleration at scale by leveraging TorchTitan’s HSDP2 and TorchAO’s new #float8 rowwise. Along with substantial speedups, training showed comparable convergence and stability to BF16. 📖➡️…
We just demonstrated proof of stability at scale for PyTorch native float8 training with rowwise scales. Similar convergence to bfloat16 with a ~33% speedup! pytorch.org/blog/accelerat…
For any ML folks who want to deepen their understanding of ML scalability & performance techniques, I wrote an illustrated deep-dive into Megatron-style tensor parallelism: danielvegamyhre.github.io/ml/performance… any feedback is welcome!