soham
@SohamGovande
@OpenAI 🌲 stanford cs | prev @nvidia @hazyresearch
introducing chipmunk—a training-free algorithm making ai video generation 3.7x & image gen 1.6x faster! ⚡️ our kernels for column-sparse attention are 9.3x faster than FlashAttention-3 and column-sparse GEMM is 2.5x faster vs. cuBLAS a thread on the GPU kernel optimizations 🧵
Our latest joint work w/ SandyResearch @ UCSD: training-free acceleration of Diffusion Transformers w/ dynamic sparsity, led by @austinsilveria @SohamGovande! ⚡️ 3.7x faster video and 1.6x faster image generation while preserving quality! 🧵 Open-source code & CUDA kernels!
After 2 years at @nvidia, I’m writing to share that I’ll start a new adventure. Working with brilliant teammates on cutting‑edge AI has shaped me so much: - Cosmos debuted as a SOTA world model and earned 8 k⭐️ on GitHub. - We open‑sourced the first recipe for upcycling 100 B+…
Two papers at the workshop I’m a bit fond of… @austinsilveria and @SohamGovande are going to be presenting Chipmunk - come chat with them about how they made video diffusion 3.7x faster! (With custom column-sparse attention kernels) 3/
Training-free acceleration of Diffusion Transformers with dynamic sparsity and cross-step attention/MLP deltas--collaboration with @SohamGovande and @realDanFu! ⚡️ 3.7x faster video and 1.6x faster image generation while preserving quality! 🧵 Open-source code & CUDA kernels!
I've gotten great results just from asking "how would someone much better than me approach this?" It's helped me learn fast as a beginner, and at times go from good to world-class. Mimicry & simulation are superpowers we overlook in favor of learning "the right way." More below.
"how would someone much much better than me approach this?" also annoyingly OP
Thrilled to share that I’ve joined @reflection_ai! We’re building superintelligent autonomous systems by co-designing research and product. Today, we’re launching Asimov. As AI benchmarks saturate, evaluation will increasingly live inside real-world products that are…
Engineers spend 70% of their time understanding code, not writing it. That’s why we built Asimov at @reflection_ai. The best-in-class code research agent, built for teams and organizations.
🐿️ chipmunk ship! flux kontext supported for up to 30% faster cute chipmunks!
Some updates to Chipmunk! 🐿️ Chipmunk now supports Wan 2.1, with up to 2.67x speedup - completely training-free! The paper is up on arXiv - take a look to see more in-depth analysis of sparsity in video models. Only 5-25% of activations account for >90% of the output!
chipmunk is up on arxiv! across HunyuanVideo and Flux.1-dev, 5-25% of the intermediate activation values in attention and MLPs account for 70-90% of the change in activations across steps caching + sparsity speeds up generation by only recomputing fast changing activations
super fun to work on this :)
chipmunk is up on arxiv! across HunyuanVideo and Flux.1-dev, 5-25% of the intermediate activation values in attention and MLPs account for 70-90% of the change in activations across steps caching + sparsity speeds up generation by only recomputing fast changing activations
Abundance campus organizing mentioned in the WSJ :) I care about this movement because my generation grew up in an era of dysfunctional, antagonistic, and reactive politics. Abundance focuses on outcomes, progress, and a vision for better days ahead. More coming soon!
(1/5) We’ve never enjoyed watching people chop Llamas into tiny pieces. So, we’re excited to be releasing our Low-Latency-Llama Megakernel! We run the whole forward pass in single kernel. Megakernels are faster & more humane. Here’s how to treat your Llamas ethically: (Joint…
terrific work @Avanika15 & team! hybrid local and cloud LLM interactions are the future
can you chat privately with a cloud llm—*without* sacrificing speed? excited to release minions secure chat: an open-source protocol for end-to-end encrypted llm chat with <1% latency overhead (even @ 30B+ params!). cloud providers can’t peek—messages decrypt only inside a…