Simran Arora
@simran_s_arora
building ai systems, cs phd @stanfordailab @hazyresearch, incoming asst. prof. of cms @caltech
Wish writing AI kernels was like writing PyTorch??? Enter ThunderKittens 0.002: for simpler, faster, more adorable AI kernels! We use TK to provide 10-40% faster attention backwards, CuBLAS-speed GEMMs, 8x faster state space models, 14x faster linear attentions – averaging <200…

👑 We’re #1! Sonic-2 leads @Labelbox’s Speech Generation Leaderboard topping out in speech quality, word error rate, and naturalness. Build your real-time voice apps with the 🥇 best voice AI model. ➡️ labelbox.com/leaderboards/s…
Join us at ES-FoMo tomorrow!! It's a great lineup!
Looking forward to seeing everyone for ES-FoMo part three tomorrow! We'll be in East Exhibition Hall A (the big one), and we've got an exciting schedule of invited talks, orals, and posters planned for you tomorrow. Let's meet some of our great speakers! 1/
Cartridges could be this "missing learning paradigm" Karpathy talks about 1) agent does tasks, collects memories that help it do better via ICL 2) memories are trained / compacted into Cartridges 3) Cartridges shared / composed / RAG-ed between other agents
cant stop thinking about this one insanely elegant, seems insanely powerful
thanks @willccbb!! checkout Cartridges at ICML ES-FoMo this week :) excited for what's next
cant stop thinking about this one insanely elegant, seems insanely powerful
On Saturday we’re hosting the ES-FoMo workshop, with @tri_dao, @dan_biderman, @simran_s_arora, @m_ryabinin and others - we’ve got a great slate of papers and invited talks, come join us! (More on the great slate of speakers soon) x.com/esfomo/status/… 2/
ES-FoMo is back for round three at #ICML2025! Join us in Vancouver on Saturday July 19 for a day dedicated to Efficient Systems for Foundation Models: from 💬reasoning models to🖼️scalable multimodality, 🧱efficient architectures, and more! Submissions due May 26! More below 👇
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
1/10 ML can solve PDEs – but precision🔬is still a challenge. Towards high-precision methods for scientific problems, we introduce BWLer 🎳, a new architecture for physics-informed learning achieving (near-)machine-precision (up to 10⁻¹² RMSE) on benchmark PDEs. 🧵How it works:
🤖 Household robots are becoming physically viable. But interacting with people in the home requires handling unseen, unconstrained, dynamic preferences, not just a complex physical domain. We introduce ROSETTA: a method to generate reward for such preferences cheaply. 🧵⬇️
How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning…
Thrilled to share that I'll be starting as an Assistant Professor at Georgia Tech (@ICatGT / @GTrobotics / @mlatgt) in Fall 2026. My lab will tackle problems in robot learning, multimodal ML, and interaction. I'm recruiting PhD students this next cycle – please apply/reach out!
Struggling with context management? Wish you could just stick it all in your model? We’ve integrated Cartridges, a new method of leveraging sleep-time compute for learning long contexts, into Tokasaurus, an inference engine optimized for high-throughput 🧵
🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46% 🌐 Website: multiverse4fm.github.io 🧵 1/n
Claude not able to continue my research chat about context compression papers because it ran out of context because it doesn't use context compression.
Looks like a very slick way to tune and cheaply serve custom models! If I were building on this, I’d try to find a better way to initialize the cache. You can initialize LoRA as a no-op and let backprop handle the rest, but KV-tuning methods need weird initialization hacks.
When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x…
Today we shipped a new real time API for streaming speech to text (a new family of models called Ink), that’s extremely fast, cheap and designed specifically for voice agents. We’re cooking hard, lots more releases coming soon 🧑🍳
Building voice agents? Meet Ink-Whisper: the fastest, most affordable streaming speech-to-text model. 🌎 Optimized for accuracy in real-world conditions 👯 Pair with our Sonic text-to-speech → fastest duo in voice AI 🔌 Plugs into @Vapi_AI,@pipecat_ai, @livekit Read more:…
I like this idea very much and have long advocated for something like this. Synthetically enriched «KV prefix» is a natural augment to modern long context models.
Cartridges: Storing long contexts in tiny caches with self-study - train-once, reusable memory via SELF-STUDY - 38.6× less memory, 26.4× higher throughput - extends context to 484k, composes across corpora - outperforms LoRA, DuoAttention, and standard ICL BLOG:…
Trading online compute for offline compute is an under-discussed axis of scaling, but one that will be increasingly relevant going forward.
When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x…
Cartridges = an interesting offline alternative to regular ICL for frequently used large text corpora. 👇 A lot to learn in this awesome work imo. (another one from a Hazy Research team) Bravo to the team 👏
There’s been tons of work on KV-cache compression and KV-cache free Transformer-alternatives (SSMs, linear attention) models for long-context, but we know there’s no free lunch with these methods. The quality-memory tradeoffs are annoying. *Is all lost?* Introducing CARTRIDGES:…
more evidence that kv caches have a lot of room for compression.
There’s been tons of work on KV-cache compression and KV-cache free Transformer-alternatives (SSMs, linear attention) models for long-context, but we know there’s no free lunch with these methods. The quality-memory tradeoffs are annoying. *Is all lost?* Introducing CARTRIDGES:…
Cartridges: Storing long contexts in tiny caches with self-study - train-once, reusable memory via SELF-STUDY - 38.6× less memory, 26.4× higher throughput - extends context to 484k, composes across corpora - outperforms LoRA, DuoAttention, and standard ICL BLOG:…