Amirhossein Kazemnejad @ ICML

@a_kazemnejad

Working on RL training of LLMs @Mila_Quebec. Prev: @mcgillu

Montréal, Canada

Joined March 2021

580Following

2KFollowers

Pinned

Amirhossein Kazemnejad @ ICML@a_kazemnejad · Oct 3

VinePPO, a straightforward modification to PPO, unlocks RL’s true potential for LLM Reasoning. It beats RL-free methods (DPO and RestEM) and PPO, surpassing it in less steps(up to 9x), less time(up to 3x), and less KL with half memory. Time to rethink RL post-training🧵: [1/n]

a_kazemnejad's tweet image. VinePPO, a straightforward modification to PPO, unlocks RL’s true potential for LLM Reasoning.

It beats RL-free methods (DPO and RestEM) and PPO, surpassing it in less steps(up to 9x), less time(up to 3x), and less KL with half memory.

Time to rethink RL post-training🧵: [1/n]

484

436

93.0K

Pinned

Amirhossein Kazemnejad @ ICML@a_kazemnejad · Jul 9

AgentRewardBench will be presented at @COLM_conf 2025 in Montreal! See you soon and ping me if you want to meet up!

XXing Han Lu@xhluca · Apr 15

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories. We find that rule-based evals underreport success rates, and…

4.0K

Amirhossein Kazemnejad @ ICML Retweeted

Reza Bayat@reza_byt · Jul 16

📄 New Paper Alert! ✨ 🚀Mixture of Recursions (MoR): Smaller models • Higher accuracy • Greater throughput Across 135 M–1.7 B params, MoR carves a new Pareto frontier: equal training FLOPs yet lower perplexity, higher few‑shot accuracy, and more than 2x throughput.…

237

146

22.0K

Amirhossein Kazemnejad @ ICML Retweeted

Nicholas Meade@ncmeade · Jul 14

I'll be at #ICML2025 this week presenting SafeArena (Wednesday 11AM - 1:30PM in East Exhibition Hall E-701). Come by to chat with me about web agent safety (or anything else safety-related)!

5.0K

Amirhossein Kazemnejad @ ICML@a_kazemnejad · May 28

Thanks @_akhaliq for sharing our work! Excited to present our next generation of SVG models, now using Reinforcement Learning from Rendering Feedback (RLRF). 🧠 We think we cracked SVG generalization with this one. Go read the paper! arxiv.org/abs/2505.20793 More details on…

AAK@_akhaliq · May 28

Rendering-Aware Reinforcement Learning for Vector Graphics Generation RLRF significantly outperforms supervised fine-tuning, addressing common failure modes and enabling precise, high-quality SVG generation with strong structural understanding and generalization

123

15.0K

Amirhossein Kazemnejad @ ICML@a_kazemnejad · May 1

Deliberate practice is accepted to #ICML2025 as a spotlight (top 2.6%!) 🚀

RReyhane Askari@ReyhaneAskari · Feb 28

🚀 New Paper Alert! Can we generate informative synthetic data that truly helps a downstream learner? Introducing Deliberate Practice for Synthetic Data (DP)—a dynamic framework that focuses on where the model struggles most to generate useful synthetic training examples. 🔥…

152

16.0K

Amirhossein Kazemnejad @ ICML@a_kazemnejad · Apr 25

I'm presenting our recent work on "Pitfalls of Memorization" today at ICLR Number #304 at 3pm.. Come say hi! iclr.cc/virtual/2025/p…

RReza Bayat@reza_byt · Dec 14

New Paper! 📄 Once a model memorizes an example, it stops learning from it! Our latest work explores this phenomenon and the nuanced interplay between memorization and generalization. Let’s dive in! 🚀🧵

9.0K

Amirhossein Kazemnejad @ ICML Retweeted

Noam Brown@polynoamial · Apr 25

It's deeply concerning that one of the best AI researchers I've worked with, @kaicathyc, was denied a U.S. green card today. A Canadian who's lived and contributed here for 12 years now has to leave. We’re risking America’s AI leadership when we turn away talent like this.

416

763

9.0K

1.0K

2.5M

Amirhossein Kazemnejad @ ICML@a_kazemnejad · Apr 25

Soon we'll be training LLM agents end-to-end in collaborative environments, and our current PPO/GRPO-based algos simply won't work. @MAghajohari & collabs have been working on scalable RL algos for this future for a while. Advantage Alignment is the new highly simplified version.

MMilad Aghajohari@MAghajohari · Apr 25

Multi-Agent RL fails in real life. Agents cooperating to solve tasks remains a utopia. -No scalable algorithms for general-sum games. -In a simple apple-harvesting game, PPO agents overharvest and ruin bushes. Advantage Alignment (ICLR 2025 Oral📢) is a huge step forward. 1/n

1.0K

Amirhossein Kazemnejad @ ICML Retweeted

AK@_akhaliq · Apr 15

AgentRewardBench Evaluating Automatic Evaluations of Web Agent Trajectories

158

32.0K

Amirhossein Kazemnejad @ ICML@a_kazemnejad · Apr 15

A key reason RL for web agents hasn’t fully taken off is the lack of robust reward models. No matter the algorithm (PPO, GRPO), we can’t reliably do RL without a reward signal. With AgentRewardBench, we introduce the first benchmark aiming to kickstart progress in this space.

XXing Han Lu@xhluca · Apr 15

7.0K

Amirhossein Kazemnejad @ ICML@a_kazemnejad · Apr 12

DeepSeek-R1 Thoughtology now #2 on @huggingface daily papers Thanks for building this great platform for sharing new papers @_akhaliq

XXing Han Lu@xhluca · Apr 11

DeepSeek-R1 Thoughtology: Let’s <think> about LLM reasoning 142-page report diving into the reasoning chains of R1. It spans 9 unique axes: safety, world modeling, faithfulness, long context, etc.

111

23.0K

Amirhossein Kazemnejad @ ICML@a_kazemnejad · Apr 11

I think one of the most underrated sources of insight in research is just looking at the model's outputs. The Thoughtology paper is what happens when an entire lab of grad students at Mila do this cumbersome task for R1's CoT and actually quantifies all the patterns we saw.

SSara Vera Marjanović@saraveramarjano · Apr 1

Models like DeepSeek-R1 🐋 mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1’s reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour. 🔗: mcgill-nlp.github.io/thoughtology/

4.0K

Amirhossein Kazemnejad @ ICML@a_kazemnejad · Apr 9

Happy to share that my Google DeepMind internship project is finally out!

CCarl Doersch@CarlDoersch · Apr 9

We're very excited to introduce TAPNext: a model that sets a new state-of-art for Tracking Any Point in videos, by formulating the task as Next Token Prediction. For more, see: tap-next.github.io 🧵

1.0K

Amirhossein Kazemnejad @ ICML Retweeted

Michael Noukhovitch@mnoukhov · Apr 7

Llama 4 uses async RLHF and I would like to announce that I called it arxiv.org/abs/2410.18252

127

12.0K

Amirhossein Kazemnejad @ ICML@a_kazemnejad · Apr 3

I wish this existed when I started working on RL for LLM. Thus, created it. Other codebases are industry first: Complex, Ray, Unhackable, Multi-Node oriented... The best RL for LLM codebase for academia and comes with a 5h implementation video starting from an empty notebook.

AAmirhossein Kazemnejad @ ICML@a_kazemnejad · Apr 3

Introducing nanoAhaMoment: Karpathy-style, single file RL for LLM library (<700 lines) - super hackable - no TRL / Verl, no abstraction💆‍♂️ - Single GPU, full param tuning, 3B LLM - Efficient (R1-zero countdown < 10h) comes with a from-scratch, fully spelled out YT video [1/n]

103

11.0K