Shizhe Diao
@shizhediao
Research Scientist @NVIDIA focusing on efficient post-training of LLMs. Finetuning your own LLMs with LMFlow: http://go.uic.edu/shizhe Views are my own.
Does RL truly expand a model’s reasoning🧠capabilities? Contrary to recent claims, the answer is yes—if you push RL training long enough! Introducing ProRL 😎, a novel training recipe that scales RL to >2k steps, empowering the world’s leading 1.5B reasoning model💥and offering…

Pass@1024 results of our RL model (AceReason-Nemotron-7B) and its starting SFT model (DeepSeek-R1-Distill-Qwen-7B) on LiveCodeBench-v6, which features a large answer space and high-quality test cases that are difficult to solve through 'guessing', even with extensive sampling.…
Introducing AceReason-Nemotron: Advancing math and code reasoning through reinforcement learning (RL) We propose conducting RL on math-only prompts first, then on code-only prompts. Our key findings include: - Math-only RL significantly boosts both math and code benchmarks! -…
Empowering the model to independently manage the full task is fundamental to unlocking its full potential.
Heard GPT-5 is imminent, from a little bird. - It’s not one model, but multiple models. It has a router that switches between reasoning, non-reasoning, and tool-using models. - That’s why Sam said they’d “fix model naming”: prompts will just auto-route to the right model. -…
🚀 Thrilled to announce Dream-Coder 7B — the most powerful open diffusion code LLM to date.
Huge leap for open-source theorem proving. Goedel-Prover V2 matches 671B models with just 8B...
(1/4)🚨 Introducing Goedel-Prover V2 🚨 🔥🔥🔥 The strongest open-source theorem prover to date. 🥇 #1 on PutnamBench: Solves 64 problems—with far less compute. 🧠 New SOTA on MiniF2F: * 32B model hits 90.4% at Pass@32, beating DeepSeek-Prover-V2-671B’s 82.4%. * 8B > 671B: Our 8B…
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
🚨 NVIDIA is launching the Data Filtering Challenge for training edge language models! We believe edge LMs are the future — lightweight, powerful, and ready for real-world tasks like: 🧠 Reasoning 🗣️ Roleplay 🔍 RAG 🔧 Function calling Time to push dataset filtering to the…

Amazing
It finally happened 😭 After 8 months of hard work, the OpenHands agent surpassed the last human developer on our repository, @xingyaow_. Fellow humans, we had a good run.
Compressing LLMs but worried about accuracy? 🎯 New from #NVIDIAResearch: EoRA uses eigenspace low-rank approximation to compensate for errors—no retraining needed. A promising direction for scalable, task-adaptive LLMs. 🔗 nvda.ws/448xkvZ
Impressive work! The parallel generation capability of Multiverse looks amazing!
🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46% 🌐 Website: multiverse4fm.github.io 🧵 1/n
🚀 I'm looking for full-time research scientist jobs on foundation models! I study pre-training and post-training of foundation models, and LLM-based coding agents. The figure highlights my research/publications. Please DM me if there is any good fit! Highly appreciated!
Flash Linear Attention (github.com/fla-org/flash-…) will no longer maintain support for the RWKV series (existing code will remain available). Here’s why:
It does not saturate yet. At NVIDIA, we present "prolonged RL" where we significantly scale up RL training steps (+2k) and problems (+130k). The improvement from RL scaling is surprising and exciting. The RL-ed model makes great progress on some problems that the base model…
And this on a 1.5b model :), 136k problems. rl scaling makes us happy
Very excited to finally release our paper for OpenThoughts! After DataComp and DCLM, this is the third large open dataset my group has been building in collaboration with the DataComp community. This time, the focus is on post-training, specifically reasoning data.
What happens when you ✨scale up RL✨? In our new work, Prolonged RL, we significantly scale RL training to >2k steps and >130k problems—and observe exciting, non-saturating gains as we spend more compute 🚀.
RL scaling is here arxiv.org/pdf/2505.24864
Super excited to share 💪🧠Reasoning Gym! 🧵 We provide over 100 data generators and verifiers spanning several domains (algebra, arithmetic, code, geometry, logic, games) for training the next generation of reasoning models. In essence, we can generate an infinite amount of…
Ah very timely paper that validates my current intuition: RL should be scaled beyond 1k steps and for this you need to scale simultaneously the group size (here up to 256) and expand the search space.
RL scaling is here arxiv.org/pdf/2505.24864
Nvidia presents ProRL Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models