Horace He
@cHHillee
@thinkymachines Formerly @PyTorch "My learning style is Horace twitter threads" - @typedfemale
For too long, users have lived under the software lottery tyranny of fused attention implementations. No longer. Introducing FlexAttention, a new PyTorch API allowing for many attention variants to enjoy fused kernels in a few lines of PyTorch. pytorch.org/blog/flexatten… 1/10

When this word started popping up I initially smugly thought that people were misspelling "syncophant" only to realize that I'd entangled "sycophant" with "syncopation" in my head.
people using sycophant like they knew what it was
Most normal FlexAttention mask. Also, thanks for the "Implementation-wise, although FlexAttention practically enabled the project..." comment - that's perhaps the #1 thing we were hoping for with FlexAttention :)
Depending on how dense you want your neighbourhood for local attention to be, the attention matrix is very sparse and FlexAttention can use it, given that the functional form of the mask turned out to be pretty simple. 10/N 🧵
Other than OpenAI, how many other AI efforts do you think will have gotten a gold medal at the IMO? Several other AI labs are vagueposting about their IMO results, but seem to abiding by IMO's request for a week's delay.
It's been an exciting 3 months at Thinky and so much has happened already! Imo we're building some of the best research infra around. Research infra is about jointly optimizing researcher *and* GPU efficiency, and it's been a joy to work on this with the other great folk here!
Thinking Machines Lab exists to empower humanity through advancing collaborative general intelligence. We're building multimodal AI that works with how you naturally interact with the world - through conversation, through sight, through the messy way we collaborate. We're…
I'll be at MLSys today! DM me if you want to chat about Pytorch, ML systems, or life at Thinking Machines!
The fundamental question here (computing MFU) is a very reasonable question to ask in an interview (and if I'd recommend learning it if you don't know how). However, the real interview question I would like to ask is this: "I see 3 assumptions in this question that range from…
Saw this on Reddit with half the comments shitting on it
This is pretty neat. They insert into torch.compile and insert some profile-guided optimizations as well as a bunch of other specific optimizations like offloading. Since torch.compile is all in Python all their compiler passes are fairly accessible too! github.com/deepspeedai/De…
Introducing 🚀DeepCompile🚀: compiler-based distributed training optimizations. - Automatic parallelization & profile-guided optimizations - Enable ZeRO1, ZeRO3, Offloading, etc. via compiler passes - 1.2X-7X speedups over manual ZeRO1/ZeRO3/Offloading tinyurl.com/8cys28xk
I'll be here and talking about ML systems! There'll be some of the best GPU folk I know here, so come and learn more together about Blackwell GPUs!
SemiAnalysis is hosting an Nvidia Blackwell GPU Hackathon on Sunday March 16th. It is the ultimate playground for Blackwell PTX tech enthusiasts, offering hands-on exploration of Blackwell & PTX infrastructure while collaborating on open-source projects.
It is hard to overstate how cool and powerful is flex attention. @cHHillee pytorch.org/blog/flexatten… TL;DR: it is an implementation of the attention operator in @pytorch that allows in particular to efficiently "carve" the attention matrix. 1/3