Rishabh Agarwal
@agarwl_
Reinforcement Learner @AIatMeta, Adjunct Prof at McGill. Ex DeepMind, Brain, Mila, IIT Bombay. NeurIPS Best Paper
I recently gave a tutorial on knowledge distillation for LLMs, explaining the mathematical derivations behind the commonly used methods. Sharing the slides here given the recent interest in this topic. drive.google.com/file/d/1xMohjQ…

Since R1 there has been a lot of chatter 💬 on post-training LLMs with RL. Is RL only sharpening the distribution over correct responses sampled by the pretrained LLM OR is it exploring and discovering new strategies 🤔? Find answers in our latest post ⬇️ tinyurl.com/rlshadis
An advanced version of Gemini with Deep Think has officially achieved gold medal-level performance at the International Mathematical Olympiad. 🥇 It solved 5️⃣ out of 6️⃣ exceptionally difficult problems, involving algebra, combinatorics, geometry and number theory. Here’s how 🧵
I wrote up this post about how we should **unify RL and next-token-prediction** based on my perspective how humans learn new languages. then realize @jxmnop wrote the exact same thing about how we should scale RL to 10^26 FLOPs
Kimi K2 is here! The first big beautiful model purpose-built for agentic capabilities is now open-source! Agent RL, ready for takeoff!
🚀 Hello, Kimi K2! Open-Source Agentic Model! 🔹 1T total / 32B active MoE model 🔹 SOTA on SWE Bench Verified, Tau2 & AceBench among open models 🔹Strong in coding and agentic tasks 🐤 Multimodal & thought-mode not supported for now With Kimi K2, advanced agentic intelligence…
The age of transformers is ending...the dawn of linear-cost architectures is upon us. Power Attention replaces Flash Attention in any transformer, and removes the quadratic penalty of context scaling while achieving strong performance. The result: domination of both transformers…
Releasing Power Attention: manifestai.com/articles/relea…
youtu.be/E22AOHAEtu4?si… Great talk. Thanks @shuchaobi for delivering it, and @CUSEAS for uploading it.
Excited to share what I worked on during my time at Meta. - We introduce a Triton-accelerated Transformer with *2-simplicial attention*—a tri-linear generalization of dot-product attention - We show how to adapt RoPE to tri-linear forms - We show 2-simplicial attention scales…
Join my team at @genesistxai ! 🧬 We're forging AI foundation models to unlock groundbreaking therapies for patients with severe diseases. We're hiring ML Scientists, Engineers, TPMs & Interns in foundation models, #LLMs , #RL, #diffusion models, and other cutting-edge areas of…
Day 1/5 of #MiniMaxWeek: We’re open-sourcing MiniMax-M1, our latest LLM — setting new standards in long-context reasoning. - World’s longest context window: 1M-token input, 80k-token output - State-of-the-art agentic use among open-source models - RL at unmatched efficiency:…
Introducing e3 🔥 Best <2B model on math 💪 Are LLMs implementing algos ⚒️ OR is thinking an illusion 🎩.? Is RL only sharpening the base LLM distrib. 🤔 OR discovering novel strategies outside base LLM 💡? We answer these ⤵️ 🚨 arxiv.org/abs/2506.09026 🚨 matthewyryang.github.io/e3/
👉 New preprint on a new family of Transformer-type models whose depth scales logarithmically with sequence length. Enables: - fast training - fast decoding - large memory capacity in associative recall - strong length generalization on state tracking
Transformers: ⚡️fast to train (compute-bound), 🐌slow to decode (memory-bound). Can Transformers be optimal in both? Yes! By exploiting sequential-parallel duality. We introduce Transformer-PSM with constant time per token decode. 🧐 arxiv.org/pdf/2506.10918
Last day today @AIatMeta, reflecting on last several months, and wanted to highlight few things I enjoyed working with: Building new algorithms for on policy distillation with @DatHuynh13 Science of end to end thinking models @agarwl_ and many others Working prototype of…
Learning to play Atari from pixels from scratch in 30 minutes, all locally on an Apple Watch!
Slides here for my CVPR talk: drive.google.com/file/d/1xd9gPM… @anoopcherian will probably know about the recording

A few more observations after replicating the Tower of Hanoi game with their exact prompts: - You need AT LEAST 2^N - 1 moves and the output format requires 10 tokens per move + some constant stuff. - Furthermore the output limit for Sonnet 3.7 is 128k, DeepSeek R1 64K, and…
Apple just GaryMarcus'd LLM reasoning ability
🥳 Happy to share our new work – Kinetics: Rethinking Test-Time Scaling Laws 🤔How to effectively build a powerful reasoning agent? Existing compute-optimal scaling laws suggest 64K thinking tokens + 1.7B model > 32B model. But, It only shows half of the picture! 🚨 The O(N²)…
Good take -- it's a good benchmark to develop better training algorithms / inference time scaling, which you can validate on other domains. Random / incorrect rewards won't work on this one Main gotcha is to not overfit to just ARC- like puzzles.
people stopped working on ARC-AGI because they realized it was too hard
Giving my first ever invited talk at @CVPR , during the multimodal reasoning workshop: The Bitter Lesson for RL: Verification as the Key to Reasoning LLMs This talk is inspired by the two classic essays from Rich Sutton:


a great video by @jbhuang0604 explaining kl divergence and its computation
So many works talking about entropy, but what is the **mechanism** of entropy in RL for LLMs? 🤔 Our work gives a principled understanding, as well as two tricks that get entropy **controlled** 🧵