Mark Saroufim
@marksaroufim
ml systems @pytorch unleashing @GPU_MODE
Just wrote an illustrated deep-dive into overlapping the compute and comms in TP+SP using Async TP. My eyeballs hurt now so hopefully somebody finds it useful :) danielvegamyhre.github.io/ml/performance…
Thank you for the collaboration and the talk Mark. I think 🔥+PyTorch are a beautiful combination - a new flame for the torch! It was great to get to spend the day with you yesterday
At the Mojo hackathon today I went over how PyTorch is making it easier to spin up new backends for all the new languages and hardware we're seeing this year and a WIP backend we've been collaborating on with the Mojo team docs.google.com/presentation/d…
At the Mojo hackathon today I went over how PyTorch is making it easier to spin up new backends for all the new languages and hardware we're seeing this year and a WIP backend we've been collaborating on with the Mojo team docs.google.com/presentation/d…
The biggest dataset of human written GPU Code all open-source? 👀 YES Please! We at @GPU_MODE have released around 40k 🚀 human written code samples spanning Triton, Hip and PyTorch and it's all open on the @huggingface Hub. Train the new GPT to make GPTs faster ⚡️ Link below ⬇️
Make Flux go brrr on H100s without bells and whistles ⚡️ We're excited to provide a simple recipe, dubbed `flux-fast`, providing a 2.5x speedup on H100 GPUs. 🔗 Blog: hubs.la/Q03tBKP70 ➡️ Code: hubs.la/Q03tBF8-0 By Joel Schlosser & @RisingSayak
ColorsWind the team that won the AMD $100K grand prize just OSS'd their code github.com/RadeonFlow/Rad…
anyone interested in building the best codegen LLM for GPU kernels with the @GPU_MODE team, completely open-source? lmk
We've run thousands of LLM inference serving benchmarks at @modal_labs. We're releasing the results so you don't have to. We're releasing the code so that you can. Introducing: The LLM Engineer's Almanac. Just in time for the @aiDotEngineer World's Fair.
Work w/ @cocosci_lab, @karthik_r_n, and @OfirPress Paper: arxiv.org/abs/2505.18134 Code:github.com/alexzhang13/vi… Website: vgbench.com Discord: discord.gg/W89VqYhQcy Our platform is completely open source and super easy to modify / plug into!
Will be live-coding over the best performance tricks we saw from our kernel competition
Dive into GPU architecture & kernel dev with @MarkSaroufim at #AdvancingAI. Start with the basics, build a kernel, and join a live @GPU_MODE project to level up your skills. Register now → bit.ly/3SOgyf7
I've seen surprisingly few people complaining about 5090 PyTorch and Triton performance. Did y'all get past the scalpers? Should we make things go super fast?
At this point submitting your company reference kernels to the GPU MODE kernel leaderboard is the easiest way to accelerate them. All submissions will be made public at the end of every competition so please use the fast code and hire the cracked engineers who produced it
More cracked submissions to the @AMD x @GPU_MODE leaderboard! ✍️18k+ submissions since the beginning!! 🏆FP8 GEMM: A battle btwn Seb and Snektron for 🥇, with a 25% faster kernel since 2 weeks ago! 🤯 Single-device MoE: multiple ppl are now 100-600x faster than PyTorch ref!
🎉Factorio Learning Environment 0.2.0 released! 📖Details: jackhopkins.github.io/factorio-learn… New Features: - Multi-agent support - Reasoning models + MCP - Reflection & backtracking - Vision-augmented inputs and more frontier model results! The initial release of FLE was met with great…
Registration ends tomorrow for the grand prize but you can still compete for glory until May 27 This is probably the hardest problem we've designed so far so glhf!
📣 Problem 2, the fused Mixture-of-Experts kernel 🍿 for MI300s, is now OPEN for the @AMD x @GPU_MODE $100k competition! Go compete now for huge cash prizes -- registration ends SOON! Good luck everyone!