Simon Guo
@simonguozirui
CS PhD student @Stanford | 🎓 @Berkeley_EECS | prev pre-training @cohere & built things at @ @anyscalecompute @nvidia
LLMs for GPU kernel🌽generation have been getting Pop🍿ular since our preview last Dec; excited to announce 📢 our full paper 📃 for KernelBench! Turns out KernelBench is quite challenging 🧠 — frontier models outperform the PyTorch Eager baseline <20% of the time. More 🧵👇

When eager mode fallback quickly swoops after it hits a part of the graph that torch.compile can't compile
Luigi Death Stare (2014)
~4/8~ For the forward pass, we developed a specialized two-kernel implementation. The first fuses gather, projections, and SwiGLU, while the second uses an efficient scatter into BF16 + reduce for the sum.
Mixture‑of‑Experts (MoE) powers many frontier models like R1, K2, & Qwen3 ⚡️ To make frontier-scale MoE models accessible to train, we open-source MoMoE, a hyper-performant MoE implementation built for training and inference, outpacing the fastest existing ones by up to: - 70%…
Announcing The Toronto School Of Foundation Modelling, a Toronto exclusive, in-person only school for learning to build Foundation Models. Coming to New Stadium and Youthful Vengeance in late August 2025.
On Sep 6 in NYC, this won't be your typical hackathon where you do your own thing in a corner and then present at the of the day. You'll deploy real models to the market, trades will happen, chaos should be expected. The fastest model is great but time to market matters more.
Releasing mini, a radically simple SWE-agent: 100 lines of code, 0 special tools, and gets 65% on SWE-bench verified! Made for benchmarking, fine-tuning, RL, or just for use from your terminal. It’s open source, simple to hack, and compatible with any LM! Link in 🧵
I’m building a new team at @GoogleDeepMind to work on Open-Ended Discovery! We’re looking for strong Research Scientists and Research Engineers to help us push the frontier of autonomously discovering novel artifacts such as new knowledge, capabilities, or algorithms, in an…
Our new state-of-the-art AI model Aeneas transforms how historians connect the past. 📜 Ancient inscriptions often lack context – it's like solving a puzzle with 90% of the pieces lost to time. It helps researchers interpret and situate inscriptions in their past context. 🧵
We tested Aeneas on the Res Gestae Divi Augusti – one of the most debated inscriptions. Without prior knowledge, it successfully mapped out the leading scholarly theories on its dating, showing how AI can help model history in a quantitative way. 📊
🏆 Our @nvidia KV Cache Compression Leaderboard is now live! Compare state-of-the-art compression methods side-by-side with KVPress. See which techniques are leading in efficiency and performance. 🥇 huggingface.co/spaces/nvidia/…
Check out Tokasaurus on Modal to make Llama-1B brrr! This repeated sampling example shows off two engine features that are important for serving small models: very low CPU overhead and automatic shared prefix exploitation with Hydragen.
Tokasaurus, the "little LLM engine that could" by @jordanjuravsky and @EyubogluSabri of @HazyResearch/@ScalingIntelLab, is capable of some pretty impressive perf. We replicated their report of >80k tok/s for 16bit LLaMA 3.1 8B on Large Language Monkeys GSM8K - and you can too!
officially out! imo this is super exciting for two reasons: 1) text in text out, no lean decoding or tool-use and 2) reasoning capabilities that scale with compute
An advanced version of Gemini with Deep Think has officially achieved gold medal-level performance at the International Mathematical Olympiad. 🥇 It solved 5️⃣ out of 6️⃣ exceptionally difficult problems, involving algebra, combinatorics, geometry and number theory. Here’s how 🧵
Infinite Wiki ⁕ Every word is a hyperlink. Every description is generated in real-time (in ~1 second) ⁕ Runs on Gemini 2.5 Flash Lite. ASCII diagrams using 2.5 Flash
developer.nvidia.com/blog/cutlass-p… marks the start of a short series of blogposts about CUTLASS 3.x and CuTe that we've been meaning to write for years. There are a few more parts to come still, hope you enjoy!
Watching the model solve these IMO problems and achieve gold-level performance was magical. A few thoughts 🧵
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).
After studying the mathematics and computation of Sparsity for nearly 20 years, I have just realized that it is much more important than I ever realized before. It truly serves as *the* model problem to understand deep networks and even intelligence to a large extent, from a…
And @keshigeyan is going to be presenting about Grafting - a great collaboration with @MichaelPoli6 on how to distill pretrained diffusion models into new architectures (Transformers -> Hyenas) 4/
1/ Model architectures have been mostly treated as fixed post-training. 🌱 Introducing Grafting: A new way to edit pretrained diffusion transformers, allowing us to customize architectural designs on a small compute budget. 🌎 grafting.stanford.edu Co-led with @MichaelPoli6
If you’re staying for the #ICML2025 workshops, you should definitely go to @m_sirovatka’s talk today on the infra and design of @GPU_MODE’s OSS GPU leaderboard. He has a lot of interesting stuff to share :D
cant stop thinking about this one insanely elegant, seems insanely powerful
Large Language Monkeys are scaling and they are hungry! 🍌 #ICML2025
I'll be at @icmlconf #ICML2025 next week to present three papers - reach out if you want to chat about generative AI, scaling laws, synthetic data or any other AI topic! #1 How Do Large Language Monkeys Get Their Power (Laws)? x.com/RylanSchaeffer…