Vijay (@__tensorcore__)

Pinned

V

Vijay@__tensorcore__ · May 13

🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/media/…

__tensorcore__'s tweet image. 🚨🔥 CUTLASS 4.0 is released 🔥🚨

pip install nvidia-cutlass-dsl

4.0 marks a major shift for CUTLASS: towards native GPU programming in Python

slidehelloworld.png

docs.nvidia.com/cutlass/media/…

16

86

422

155

73.0K

Vijay Retweeted

S

SemiAnalysis@SemiAnalysis_ · Jul 22

ariXv gpu kernel researcher be like: • liquid nitrogen cooling their benchmark GPU • overclock their H200 to 1000W "Custom Thermal Solution CTS" • nvidia-smi boost-slider --vboost 1 • nvidia-smi -i 0 --lock-gpu-clocks=1830,1830 • use specially binned GPUs where the number…

5

8

115

33

12.0K

Vijay Retweeted

V

Vijay@__tensorcore__ · Jul 21

Part 2: developer.nvidia.com/blog/cutlass-3… Covers the design of CUTLASS 3.x itself and how it builds a 2 layer GPU microkernel abstraction using CuTe as the foundation.

0

3

14

4

724

V

Vijay@__tensorcore__ · Jul 21

CUTLASS 4.1 is now available, which adds support for ARM systems (GB200) and block scaled MMAs

VVijay@__tensorcore__ · May 13

🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/media/…

4

12

121

19

7.0K

V

Vijay@__tensorcore__ · Jul 21

Hierarchical layout is super elegant. Feels like the right abstraction for high performance GPU kernels. FlashAttention 2 actually started bc we wanted to rewrite FA1 in CuTe

VVijay@__tensorcore__ · Jul 20

developer.nvidia.com/blog/cutlass-p… marks the start of a short series of blogposts about CUTLASS 3.x and CuTe that we've been meaning to write for years. There are a few more parts to come still, hope you enjoy!

1

13

154

63

16.0K

V

Vijay@__tensorcore__ · Jul 20

CuTe is such an elegant library that we stopped working on our own system and wholeheartedly adopted CUTLASS for vLLM in the beginning of 2024. I can happily report that was a very wise investment! Vijay and co should be so proud of the many strong OSS projects built on top 🥳

VVijay@__tensorcore__ · Jul 20

developer.nvidia.com/blog/cutlass-p… marks the start of a short series of blogposts about CUTLASS 3.x and CuTe that we've been meaning to write for years. There are a few more parts to come still, hope you enjoy!

0

4

88

26

5.0K

V

Vijay@__tensorcore__ · Jul 17

This is what the internet was made for 🥹

ttypedfemale@typedfemale · Jul 17

presenting: big jeff's trainium hell

0

12

2

902

Vijay Retweeted

A

Ali Hassani@AliHassaniJr · Jul 11

Cosmos-Predict2 meets NATTEN. We just released variants of Cosmos-Predict2 where we replace most self attentions with neighborhood attention, bringing up to 2.6X end-to-end speedup, with minimal effect on quality! github.com/nvidia-cosmos/… (1/5)

1

8

37

23

16.0K

V

Vijay@__tensorcore__ · Jul 10

Getting mem-bound kernels to speed-of-light isn't a dark art, it's just about getting the a couple of details right. We wrote a tutorial on how to do this, with code you can directly use. Thanks to the new CuTe-DSL, we can hit speed-of-light without a single line of CUDA C++.

WWentao Guo@WentaoGuo7 · Jul 10

🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 🤯 With @tedzadouri and @tri_dao

7

57

520

361

47.0K

Vijay Retweeted

W

Wentao Guo@WentaoGuo7 · Jul 10

🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 🤯 With @tedzadouri and @tri_dao

11

67

317

189

72.0K

V

Vijay@__tensorcore__ · Jun 8

Another 🔥 blog about CUTLASS from @colfaxintl, this time focusing on the gory details of block-scaled MXFP and NVFP data types and Blackwell kernels for them. research.colfax-intl.com/cutlass-tutori…

__tensorcore__'s tweet card. Welcome to part 3 of our series investigating GEMM on the NVIDIA Blackwell architecture. In parts 1 and 2, we looked at the Tensory Memory and 2 SM capabilities of the new Blackwell Tensor Core UMM…

0

36

157

81

12.0K

Vijay Retweeted

M

Moon@MoonL88537 · May 29

did i mention that this is totally nuts?

187

447

6.0K

2.0K

672.0K

V

Vijay@__tensorcore__ · May 29

We've been thinking about what the "ideal" architecture should look like in the era where inference is driving AI progress. GTA & GLA are steps in this direction: attention variants tailored for inference: high arithmetic intensity (make GPUs go brr even during decoding), easy to…

TTed Zadouri@tedzadouri · May 29

"Pre-training was hard, inference easy; now everything is hard."-Jensen Huang. Inference drives AI progress b/c of test-time compute. Introducing inference aware attn: parallel-friendly, high arithmetic intensity – Grouped-Tied Attn & Grouped Latent Attn

6

58

465

359

54.0K

V

Vijay@__tensorcore__ · May 23

Every GPU kernel writer in shambles

4

8

118

48

10.0K

V

Vijay@__tensorcore__ · May 19

0

4

47

2

6.0K

Vijay Retweeted

J

Jinay@jinaycodes · May 18

Introducing soarXiv ✈️, the most beautiful way to explore human knowledge Take any paper's URL and replace arxiv with soarxiv (show in video) to teleport to its place in the universe I've embedded all 2.8M papers up until April 2025 Try it at: soarxiv dot org

154

1.0K

9.0K

8.0K

500.0K

Vijay Retweeted

E

Elliot Arledge@elliotarledge · May 16

timelapse #58 (14.5 hrs): - used cutlass python DSL to increase elementwise add/mul memory throughput (from pytorch 500GB/s to cutlass 850GB/s) - diving into cutlass 4.0 (minus tile abstractions) - cuda book design decisions with @mrsiipa - restructure of 5 chapters -…

3

75

22

4.0K

V

Vijay@__tensorcore__ · May 13

I love Cutlass, and this new Python DSL looks very well-designed. Will for sure accelerate kernel dev + exploring new ideas in ML + GPU. I'm already playing with it and having fun

VVijay@__tensorcore__ · May 13

🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/media/…

4

25

224

65

17.0K