Amirhossein Kazemnejad @ ICML
@a_kazemnejad
Working on RL training of LLMs @Mila_Quebec. Prev: @mcgillu
VinePPO, a straightforward modification to PPO, unlocks RL’s true potential for LLM Reasoning. It beats RL-free methods (DPO and RestEM) and PPO, surpassing it in less steps(up to 9x), less time(up to 3x), and less KL with half memory. Time to rethink RL post-training🧵: [1/n]
![a_kazemnejad's tweet image. VinePPO, a straightforward modification to PPO, unlocks RL’s true potential for LLM Reasoning.
It beats RL-free methods (DPO and RestEM) and PPO, surpassing it in less steps(up to 9x), less time(up to 3x), and less KL with half memory.
Time to rethink RL post-training🧵: [1/n]](https://pbs.twimg.com/media/GY-weC3WgAAK4pI.jpg)
![a_kazemnejad's tweet image. VinePPO, a straightforward modification to PPO, unlocks RL’s true potential for LLM Reasoning.
It beats RL-free methods (DPO and RestEM) and PPO, surpassing it in less steps(up to 9x), less time(up to 3x), and less KL with half memory.
Time to rethink RL post-training🧵: [1/n]](https://pbs.twimg.com/media/GY-whxyWYAEepaf.jpg)
AgentRewardBench will be presented at @COLM_conf 2025 in Montreal! See you soon and ping me if you want to meet up!
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories. We find that rule-based evals underreport success rates, and…
📄 New Paper Alert! ✨ 🚀Mixture of Recursions (MoR): Smaller models • Higher accuracy • Greater throughput Across 135 M–1.7 B params, MoR carves a new Pareto frontier: equal training FLOPs yet lower perplexity, higher few‑shot accuracy, and more than 2x throughput.…
I'll be at #ICML2025 this week presenting SafeArena (Wednesday 11AM - 1:30PM in East Exhibition Hall E-701). Come by to chat with me about web agent safety (or anything else safety-related)!
Thanks @_akhaliq for sharing our work! Excited to present our next generation of SVG models, now using Reinforcement Learning from Rendering Feedback (RLRF). 🧠 We think we cracked SVG generalization with this one. Go read the paper! arxiv.org/abs/2505.20793 More details on…
Rendering-Aware Reinforcement Learning for Vector Graphics Generation RLRF significantly outperforms supervised fine-tuning, addressing common failure modes and enabling precise, high-quality SVG generation with strong structural understanding and generalization
Deliberate practice is accepted to #ICML2025 as a spotlight (top 2.6%!) 🚀
🚀 New Paper Alert! Can we generate informative synthetic data that truly helps a downstream learner? Introducing Deliberate Practice for Synthetic Data (DP)—a dynamic framework that focuses on where the model struggles most to generate useful synthetic training examples. 🔥…
I'm presenting our recent work on "Pitfalls of Memorization" today at ICLR Number #304 at 3pm.. Come say hi! iclr.cc/virtual/2025/p…
New Paper! 📄 Once a model memorizes an example, it stops learning from it! Our latest work explores this phenomenon and the nuanced interplay between memorization and generalization. Let’s dive in! 🚀🧵
It's deeply concerning that one of the best AI researchers I've worked with, @kaicathyc, was denied a U.S. green card today. A Canadian who's lived and contributed here for 12 years now has to leave. We’re risking America’s AI leadership when we turn away talent like this.
Soon we'll be training LLM agents end-to-end in collaborative environments, and our current PPO/GRPO-based algos simply won't work. @MAghajohari & collabs have been working on scalable RL algos for this future for a while. Advantage Alignment is the new highly simplified version.
Multi-Agent RL fails in real life. Agents cooperating to solve tasks remains a utopia. -No scalable algorithms for general-sum games. -In a simple apple-harvesting game, PPO agents overharvest and ruin bushes. Advantage Alignment (ICLR 2025 Oral📢) is a huge step forward. 1/n
AgentRewardBench Evaluating Automatic Evaluations of Web Agent Trajectories
A key reason RL for web agents hasn’t fully taken off is the lack of robust reward models. No matter the algorithm (PPO, GRPO), we can’t reliably do RL without a reward signal. With AgentRewardBench, we introduce the first benchmark aiming to kickstart progress in this space.
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories. We find that rule-based evals underreport success rates, and…
DeepSeek-R1 Thoughtology now #2 on @huggingface daily papers Thanks for building this great platform for sharing new papers @_akhaliq
DeepSeek-R1 Thoughtology: Let’s <think> about LLM reasoning 142-page report diving into the reasoning chains of R1. It spans 9 unique axes: safety, world modeling, faithfulness, long context, etc.
I think one of the most underrated sources of insight in research is just looking at the model's outputs. The Thoughtology paper is what happens when an entire lab of grad students at Mila do this cumbersome task for R1's CoT and actually quantifies all the patterns we saw.
Models like DeepSeek-R1 🐋 mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1’s reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour. 🔗: mcgill-nlp.github.io/thoughtology/
Happy to share that my Google DeepMind internship project is finally out!
We're very excited to introduce TAPNext: a model that sets a new state-of-art for Tracking Any Point in videos, by formulating the task as Next Token Prediction. For more, see: tap-next.github.io 🧵
Llama 4 uses async RLHF and I would like to announce that I called it arxiv.org/abs/2410.18252
I wish this existed when I started working on RL for LLM. Thus, created it. Other codebases are industry first: Complex, Ray, Unhackable, Multi-Node oriented... The best RL for LLM codebase for academia and comes with a 5h implementation video starting from an empty notebook.
Introducing nanoAhaMoment: Karpathy-style, single file RL for LLM library (<700 lines) - super hackable - no TRL / Verl, no abstraction💆♂️ - Single GPU, full param tuning, 3B LLM - Efficient (R1-zero countdown < 10h) comes with a from-scratch, fully spelled out YT video [1/n]