Mark Saroufim

@marksaroufim

ml systems @pytorch unleashing @GPU_MODE

github.com/msaroufim

Joined April 2009

857Following

12KFollowers

Pinned

Mark Saroufim Retweeted

Daniel Vega-Myhre@vega_myhre · May 26

Just wrote an illustrated deep-dive into overlapping the compute and comms in TP+SP using Async TP. My eyeballs hurt now so hopefully somebody finds it useful :) danielvegamyhre.github.io/ml/performance…

154

142

10.0K

Pinned

Mark Saroufim@marksaroufim · May 11

Thank you for the collaboration and the talk Mark. I think 🔥+PyTorch are a beautiful combination - a new flame for the torch! It was great to get to spend the day with you yesterday

MMark Saroufim@marksaroufim · May 10

At the Mojo hackathon today I went over how PyTorch is making it easier to spin up new backends for all the new languages and hardware we're seeing this year and a WIP backend we've been collaborating on with the Mojo team docs.google.com/presentation/d…

141

17.0K

Pinned

Mark Saroufim@marksaroufim · May 10

268

140

38.0K

Mark Saroufim Retweeted

Matej Sirovatka@m_sirovatka · Jul 8

The biggest dataset of human written GPU Code all open-source? 👀 YES Please! We at @GPU_MODE have released around 40k 🚀 human written code samples spanning Triton, Hip and PyTorch and it's all open on the @huggingface Hub. Train the new GPT to make GPTs faster ⚡️ Link below ⬇️

318

150

31.0K

Mark Saroufim Retweeted

PyTorch@PyTorch · Jun 25

Make Flux go brrr on H100s without bells and whistles ⚡️ We're excited to provide a simple recipe, dubbed `flux-fast`, providing a 2.5x speedup on H100 GPUs. 🔗 Blog: hubs.la/Q03tBKP70 ➡️ Code: hubs.la/Q03tBF8-0 By Joel Schlosser & @RisingSayak

127

15.0K

Mark Saroufim@marksaroufim · Jun 17

ColorsWind the team that won the AMD $100K grand prize just OSS'd their code github.com/RadeonFlow/Rad…

marksaroufim's tweet card. Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X - RadeonFlow/RadeonFlow_Kernels

6.0K

Mark Saroufim Retweeted

Alex Zhang@a1zhang · Jun 3

anyone interested in building the best codegen LLM for GPU kernels with the @GPU_MODE team, completely open-source? lmk

218

100

20.0K

Mark Saroufim Retweeted

Charles 🎉 Frye@charles_irl · Jun 2

We've run thousands of LLM inference serving benchmarks at @modal_labs. We're releasing the results so you don't have to. We're releasing the code so that you can. Introducing: The LLM Engineer's Almanac. Just in time for the @aiDotEngineer World's Fair.

131

21.0K

Mark Saroufim Retweeted

Alex Zhang@a1zhang · May 28

Work w/ @cocosci_lab, @karthik_r_n, and @OfirPress Paper: arxiv.org/abs/2505.18134 Code:github.com/alexzhang13/vi… Website: vgbench.com Discord: discord.gg/W89VqYhQcy Our platform is completely open source and super easy to modify / plug into!

4.0K

Mark Saroufim@marksaroufim · May 27

Will be live-coding over the best performance tricks we saw from our kernel competition

AAI at AMD@AIatAMD · May 27

Dive into GPU architecture & kernel dev with @MarkSaroufim at #AdvancingAI. Start with the basics, build a kernel, and join a live @GPU_MODE project to level up your skills. Register now → bit.ly/3SOgyf7

4.0K

Mark Saroufim@marksaroufim · May 10

I've seen surprisingly few people complaining about 5090 PyTorch and Triton performance. Did y'all get past the scalpers? Should we make things go super fast?

6.0K

Mark Saroufim@marksaroufim · May 9

At this point submitting your company reference kernels to the GPU MODE kernel leaderboard is the easiest way to accelerate them. All submissions will be made public at the end of every competition so please use the fast code and hire the cracked engineers who produced it

AAlex Zhang@a1zhang · May 9

More cracked submissions to the @AMD x @GPU_MODE leaderboard! ✍️18k+ submissions since the beginning!! 🏆FP8 GEMM: A battle btwn Seb and Snektron for 🥇, with a 25% faster kernel since 2 weeks ago! 🤯 Single-device MoE: multiple ppl are now 100-600x faster than PyTorch ref!

5.0K

Mark Saroufim Retweeted

Neel Kant@_neel_kant · May 8

🎉Factorio Learning Environment 0.2.0 released! 📖Details: jackhopkins.github.io/factorio-learn… New Features: - Multi-agent support - Reasoning models + MCP - Reflection & backtracking - Vision-augmented inputs and more frontier model results! The initial release of FLE was met with great…

210

183.0K

Mark Saroufim@marksaroufim · Apr 29

Registration ends tomorrow for the grand prize but you can still compete for glory until May 27 This is probably the hardest problem we've designed so far so glhf!

AAlex Zhang@a1zhang · Apr 29

📣 Problem 2, the fused Mixture-of-Experts kernel 🍿 for MI300s, is now OPEN for the @AMD x @GPU_MODE $100k competition! Go compete now for huge cash prizes -- registration ends SOON! Good luck everyone!

2.0K