Tri Dao (@tri_dao)

Pinned

T

Tri Dao@tri_dao · Jul 11, 2024

FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage of modern GPUs. We’re releasing FlashAttention-3: 1.5-2x faster on FP16, up to 740 TFLOPS on H100 (75% util), and FP8 gets close to 1.2 PFLOPS! 1/

tri_dao's tweet image. FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage of modern GPUs. We’re releasing FlashAttention-3: 1.5-2x faster on FP16, up to 740 TFLOPS on H100 (75% util), and FP8 gets close to 1.2 PFLOPS!
1/

28

337

2.0K

731

298.0K

Tri Dao Retweeted

P

Princeton Computer Science@PrincetonCS · Jul 23

⏱️AI is making verification process easier, with models verifying proofs in minutes. 💻 Now, @prfsanjeevarora, @chijinML, @danqi_chen and @PrincetonPLI have released Goedel Prover V2, a model more efficient and more accurate than any previous model. 👉 blog.goedel-prover.com

0

25

81

20

13.0K

Tri Dao Retweeted

T

Together AI@togethercompute · Jul 23

🧠 Qwen3 just leveled up on Together AI 🚀 Qwen3-235B-A22B-Instruct-2507-FP8 isn't just another model update - it's a leap forward 📈

3

12

56

7

13.0K

Tri Dao Retweeted

Q

Qwen@Alibaba_Qwen · Jul 22

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…

267

1.0K

9.0K

4.0K

1.8M

T

Tri Dao@tri_dao · Jul 21

Hierarchical layout is super elegant. Feels like the right abstraction for high performance GPU kernels. FlashAttention 2 actually started bc we wanted to rewrite FA1 in CuTe

VVijay@__tensorcore__ · Jul 20

developer.nvidia.com/blog/cutlass-p… marks the start of a short series of blogposts about CUTLASS 3.x and CuTe that we've been meaning to write for years. There are a few more parts to come still, hope you enjoy!

1

13

156

63

17.0K

Tri Dao Retweeted

J

Jonathan Wenger@JonathanWenger5 · Jul 18

@tri_dao kicking off our workshop #CODEML on open-source in machine learning @icmlconf with a retrospective on open-source implementations of attention.

2

33

5

4.0K

Tri Dao Retweeted

E

ES-FoMo@ICML2025@ESFoMo · Jul 18

Looking forward to seeing everyone for ES-FoMo part three tomorrow! We'll be in East Exhibition Hall A (the big one), and we've got an exciting schedule of invited talks, orals, and posters planned for you tomorrow. Let's meet some of our great speakers! 1/

3

21

77

25

42.0K

Tri Dao Retweeted

M

MOSS@MOSS_workshop · Jul 18

MOSS is happening this Saturday (7/19) at West Ballroom B, Vancouver Center! We are excited to have an amazing set of talks, posters, and panel discussions on the insights from and potential of small-scale analyses. Hope to see a lot of you there! 💡

0

5

18

7

12.0K

T

Tri Dao@tri_dao · Jul 17

On Saturday we’re hosting the ES-FoMo workshop, with @tri_dao, @dan_biderman, @simran_s_arora, @m_ryabinin and others - we’ve got a great slate of papers and invited talks, come join us! (More on the great slate of speakers soon) x.com/esfomo/status/… 2/

EES-FoMo@ICML2025@ESFoMo · May 19

ES-FoMo is back for round three at #ICML2025! Join us in Vancouver on Saturday July 19 for a day dedicated to Efficient Systems for Foundation Models: from 💬reasoning models to🖼️scalable multimodality, 🧱efficient architectures, and more! Submissions due May 26! More below 👇

1

3

13

1

4.0K

T

Tri Dao@tri_dao · Jul 17

🎉 Congratulations to Together AI for raising the bar with record-fast inference on the DeepSeek-R1-0528 model, accelerated by our #NVIDIABlackwell platform—built for next-level compute, memory, and bandwidth to uplift the entire AI ecosystem. #AcceleratedComputing Learn more…

TTogether AI@togethercompute · Jul 17

Together AI Sets a New Bar: Fastest Inference for DeepSeek-R1-0528 We’ve upgraded the Together Inference Engine to run on @NVIDIA Blackwell GPUs—and the results speak for themselves: 📈 Highest known serverless throughput: 334 tokens/sec 🏃‍Fastest time to first answer token:…

0

12

77

5

7.0K

Tri Dao Retweeted

T

Together AI@togethercompute · Jul 17

Together AI Sets a New Bar: Fastest Inference for DeepSeek-R1-0528 We’ve upgraded the Together Inference Engine to run on @NVIDIA Blackwell GPUs—and the results speak for themselves: 📈 Highest known serverless throughput: 334 tokens/sec 🏃‍Fastest time to first answer token:…

7

14

102

23

37.0K

Tri Dao Retweeted

Y

Yong Lin@Yong18850571 · Jul 15

(1/4)🚨 Introducing Goedel-Prover V2 🚨 🔥🔥🔥 The strongest open-source theorem prover to date. 🥇 #1 on PutnamBench: Solves 64 problems—with far less compute. 🧠 New SOTA on MiniF2F: * 32B model hits 90.4% at Pass@32, beating DeepSeek-Prover-V2-671B’s 82.4%. * 8B > 671B: Our 8B…

7

82

248

117

56.0K

Tri Dao Retweeted

T

Together AI@togethercompute · Jul 14

🚨MAJOR DROP: Kimi K2 just landed on Together AI 🚀 An open-source 1T parameter model that beats proprietary LLMs in creativity, coding, and tool use while delivering 60-70% cost savings. Built for agents. Priced for scale. 👇

11

32

329

66

41.0K

T

Tri Dao@tri_dao · Jul 12

Congratulations to @parastooabtahi @tri_dao and Alex on this honor. Chats with people like this in the coffee room is a special pleasure at work!

PPrinceton Computer Science@PrincetonCS · Jul 11

Congrats to @parastooabtahi, @tri_dao and Alex Lombardi on being named 2025 Google Research Scholars. 🎉 The @googleresearch scholars program funds world-class research conducted by early-career professors. bit.ly/4kvpvFx

2

45

8

9.0K

T

Tri Dao@tri_dao · Jul 11

I played w it for 1h. Went through my usual prompts (math derivations, floating point optimizations, …). It’s a good model, feels comparable to the best frontier models

KKimi.ai@Kimi_Moonshot · Jul 11

🚀 Hello, Kimi K2! Open-Source Agentic Model! 🔹 1T total / 32B active MoE model 🔹 SOTA on SWE Bench Verified, Tau2 & AceBench among open models 🔹Strong in coding and agentic tasks 🐤 Multimodal & thought-mode not supported for now With Kimi K2, advanced agentic intelligence…

4

32

428

80

35.0K

Tri Dao Retweeted

P

Princeton Computer Science@PrincetonCS · Jul 11

Congrats to @parastooabtahi, @tri_dao and Alex Lombardi on being named 2025 Google Research Scholars. 🎉 The @googleresearch scholars program funds world-class research conducted by early-career professors. bit.ly/4kvpvFx

0

6

78

7

16.0K

T

Tri Dao@tri_dao · Jul 11

They’ve finally done it. They got rid of tokenizers!

SSukjun (June) Hwang@sukjun_hwang · Jul 11

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

9

58

793

371

54.0K

T

Tri Dao@tri_dao · Jul 11

CuTe DSL feels almost unreal: minimal Python code hits peak memory throughput on H100, as we show in QuACK. Can't wait for the addition of kernels optimized for Blackwell in QuACK 🦆

WWentao Guo@WentaoGuo7 · Jul 10

🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 🤯 With @tedzadouri and @tri_dao

0

1

22

11

4.0K

T

Tri Dao@tri_dao · Jul 10

🦆QuACK: blazing fast cute-DSL GPU kernels with 3TB/s goodness! Optimizing your kernels as much as possible is important... unless you are okay with leaving throughput on the table. check out this work from @wentao, @tedzadouri and @tri_dao

WWentao Guo@WentaoGuo7 · Jul 10

🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 🤯 With @tedzadouri and @tri_dao

0

6

19

6

4.0K