Anne Ouyang
@anneouyang
CS PhD student @Stanford | prev: cuDNN @Nvidia, M.Eng, B.S. in CS @MIT | efficient scalable self-improving AI systems | 🌽KernelBench
✨ New blog post 👀: We have some very fast AI-generated kernels generated with a simple test-time only search. They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch. (1/6) [🔗 link in final post]
![anneouyang's tweet image. ✨ New blog post 👀: We have some very fast AI-generated kernels generated with a simple test-time only search. They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch. (1/6)
[🔗 link in final post]](https://pbs.twimg.com/media/GsIPwK9XoAAvGmI.jpg)
Looking forward to attending ICML! Here are some works on memory/long context, verification, kernel design, multi-model AI systems, and theoretical understanding of test-time scaling from my awesome students and collaborators!
tech is full of people quietly haunted by the artist they could've been ✨

We introduce CodeARC, a new benchmark for evaluating LLMs’ inductive reasoning. Agents must synthesize functions from I/O examples—no natural language, just reasoning. 📄 arxiv.org/pdf/2503.23145 💻 github.com/Anjiang-Wei/Co… 🌐 anjiang-wei.github.io/CodeARC-Websit… #LLM #Reasoning #LLM4Code #ARC
This is a proper Vibe-coding setup for GPU programmers, and can result in getting surprisingly far! I honestly think that if this authoring experience is v1, then v10 might become the normal way GPU experts start writing serious custom kernels! Great work @anneouyang! (finally…
✨ New blog post 👀: We have some very fast AI-generated kernels generated with a simple test-time only search. They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch. (1/6) [🔗 link in final post]
Exciting! Looking forward to it
Kernelbench by @simonguozirui and @anneouyang is about to land in prime-rl 🛬 Our next reasoning model will be much better at writing kernels!
Thanks for the repro! I also attached the result of running this layer norm kernel on an Nvidia 5090 (1311% perf of baseline) Kernels are very hardware (and problem size) dependent, but that’s great news for auto kernel optimization. AI can easily run architecture and workload…
Did a mini replication on Colab of the LayerNorm kernel (because 484.4% seemed hard to believe) and it ~replicates (T4 vs L40 etc). Super impressive work! Even kernel engineers aren't safe.
cool work by @CaiaCostello! interesting takeaway about model collapse from the angle of confidence vs. diversity in math and coding tasks
1/5 Can small models learn to reason without RL or large datasets? Success of LLM post-training with synthetic data hinges on: 1. Generating Model Size 2. Synthetic Data Volume 3. Pruning Strategy 4. Number of Fine-Tuning Rounds We found a simple recipe: Think, Prune, Train (TPT)
KernelBench as a whole is great! It's a big step forward for automatic kernel generation. Looking forward to the next version
Fresh on the arXiV and powered by Modal: work from @anneouyang, @simonguozirui, and others on writing faster model inference using large language models inference 🪆
arxiv's out!
LLMs for GPU kernel🌽generation have been getting Pop🍿ular since our preview last Dec; excited to announce 📢 our full paper 📃 for KernelBench! Turns out KernelBench is quite challenging 🧠 — frontier models outperform the PyTorch Eager baseline <20% of the time. More 🧵👇
congrats on the launch!!
Write a fast kernel and run it on Discord. See how you compare against the best! If you're familiar with Leetcode, Kaggle or Codeforces then this should feel right at home
another interesting work on LLM for kernel gen ft. KernelBench!
Introducing The AI CUDA Engineer: An agentic AI system that automates the production of highly optimized CUDA kernels. sakana.ai/ai-cuda-engine… The AI CUDA Engineer can produce highly optimized CUDA kernels, reaching 10-100x speedup over common machine learning operations in…