Rafael Rafailov @ NeurIPS

@rm_rafailov

Ph.D. Student at @StanfordAILab. I work on Foundation Models and Decision Making. Previously @GoogleDeepMind @UCBerkeley

Stanford, CA

Joined May 2023

794Following

7KFollowers

Pinned

Rafael Rafailov @ NeurIPS@rm_rafailov · Jan 9

We have a new position paper on "inference time compute" and what we have been working on in the last few months! We present some theory on why it is necessary, how does it work, why we need it and what does it mean for "super" intelligence.

rm_rafailov's tweet image. We have a new position paper on "inference time compute" and what we have been working on in the last few months! We present some theory on why it is necessary, how does it work, why we need it and what does it mean for "super" intelligence.

227

1.0K

171.0K

Pinned

Rafael Rafailov @ NeurIPS@rm_rafailov · Apr 22

Since the first year of my PhD, every talk I’ve given has opened with a slide about the distant north star: dropping a robot in a home it’s never been before and having it do useful things. I think it might be time for me to find a new opening slide 😀. Thrilled to share π-0.5!

PPhysical Intelligence@physical_int · Apr 22

We got a robot to clean up homes that were never seen in its training data! Our new model, π-0.5, aims to tackle open-world generalization. We took our robot into homes that were not in the training data and asked it to clean kitchens and bedrooms. More below⤵️

120

11.0K

Rafael Rafailov @ NeurIPS@rm_rafailov · Jul 23

Missed this paper, but it’s pretty cool - it managed to scale our “Meta-CoT” proposal to 70B models by creating synthetic CoTs from search traces and post-training with RL. Thanks for the shout-out as well!

JJoongwon Kim@danieljwkim · Jul 3

Can we improve Llama 3’s reasoning abilities through post-training only? Introducing ASTRO, our new framework that teaches LLMs to perform in-context search and generate long CoT to solve math problems, via SFT and RL. Work done at @aiatmeta. 📄 Paper: arxiv.org/abs/2507.00417

10.0K

Rafael Rafailov @ NeurIPS@rm_rafailov · Jul 2

Prefil the replay buffer guys

JJason Weston@jaseweston · Jun 30

🌉 Bridging Offline & Online RL for LLMs 🌉 📝: arxiv.org/abs/2506.21495 New paper shows on verifiable & non-verifiable tasks: - Online DPO & GRPO give similar performance. - Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also. - Offline DPO…

2.0K

Rafael Rafailov @ NeurIPS@rm_rafailov · Jun 30

It’s the future.

RRylan Schaeffer@RylanSchaeffer · Jun 30

Third #ICML2025 paper! What effect will web-scale synthetic data have on future deep generative models? Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World 🔄 @JoshuaK92829 @ApratimDey2 @MGerstgrasser @rm_rafailov @sanmikoyejo 1/7

2.0K

Rafael Rafailov @ NeurIPS Retweeted

SynthLabs@synth_labs · Jun 24

Our new method (ALP) monitors solve rates across RL rollouts and applies inverse difficulty penalties during RL training. Result? Models learn an implicit difficulty estimator—allocating 5x more tokens to hard vs easy problems, cutting overall usage by 50% 🧵👇1/10

5.0K

Rafael Rafailov @ NeurIPS@rm_rafailov · Jun 22

No way man, one sample is all you need to collapse!

TTeortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex · Jun 22

How is model collapse still debated seriously Just stop. This is naivete that belongs in 2023

2.0K

Rafael Rafailov @ NeurIPS@rm_rafailov · Jun 13

Check out this work on benchmarking how well LLMs can implement ML research papers into code led by @tianyu_hua !

TTianyu Hua@tianyu_hua · Jun 13

🚨 New benchmark alert! 🚨 Can today’s LLMs implement tomorrow’s research ideas? We put them to the test. Introducing #ResearchCodeBench: 212 tasks from 2024–25 ML papers and code, most released after any model’s training cutoff. 🔗 researchcodebench.github.io 🧵

2.0K

Rafael Rafailov @ NeurIPS@rm_rafailov · Jun 13

It’s been very surprising how few people understand this.

YYunhao (Robin) Tang@robinphysics · Jun 12

Maybe to one's surprise, taking KL estimates as `kl_loss` to minimize does *not* enforce the KL. This implementation, however, is quite common in open source RL repos and recent research papers. In short: grad of an unbiased KL estimate is not an unbiased estimate of KL grad.

2.0K

Rafael Rafailov @ NeurIPS@rm_rafailov · Jun 12

I make the AI, very nice!

JJames Alcorn@JamesAlcorn94 · Jun 12

congrats @rm_rafailov on your hard-earned acceptance to the USofA as alien of officially extraordinary ability. The alien piece comes as no surprise to your mates of course, but at least the general public now has fair warning and a fighting chance. To celebrate with a fitting…

5.0K

Rafael Rafailov @ NeurIPS Retweeted

James Alcorn@JamesAlcorn94 · Jun 12

9.0K

Rafael Rafailov @ NeurIPS@rm_rafailov · Jun 9

When we first published our work on this 9 months ago it was rejected for being impractical in realistic cases. Six months later it was rejected for lack of novelty. It’s the way academic publishing goes.

NNathan Lambert@natolambert · Jun 8

Another generative / inference-time scaling reward modeling paper. It's the direction things are going.

154

14.0K

Rafael Rafailov @ NeurIPS@rm_rafailov · Jun 5

(Meta) CoTs are search inside world models (the prompt is the goal specification).

JJon Richens@jonathanrichens · Jun 4

Are world models necessary to achieve human-level agents, or is there a model-free short-cut? Our new #ICML2025 paper tackles this question from first principles, and finds a surprising answer, agents _are_ world models… 🧵

3.0K

Rafael Rafailov @ NeurIPS Retweeted

Jason Weston@jaseweston · May 16

🚨 New paper 🚨 J1: Incentivizing Thinking in LLM-as-a-Judge via RL - Converts judgement task into a verifiable one for both verifiable and non-verifiable prompts. Uses only synthetic pairwise data - Optimizes thoughts, scores, and judgments using GRPO - Outperforms all…

377

296

37.0K

Rafael Rafailov @ NeurIPS Retweeted

John Yang@jyangballin · May 7

40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.

133

652

379

98.0K

Rafael Rafailov @ NeurIPS@rm_rafailov · May 2

GenRMs

hhr0nix@hr0nix · May 1

LLMs trained to evaluate agentic trajectories give us a powerful way to boost agent performance via test-time search. But single-pass value models have their limitations. Can CoT reasoners be a better alternative? We explore this topic in our latest research blogpost 🧵⬇️

2.0K

Rafael Rafailov @ NeurIPS Retweeted

Aviral Kumar@aviral_kumar2 · Apr 25

At #ICLR25 workshops, my students+collabs will give many orals talks on newer stuff (don't miss!): - robot VLA RL fine-tuning @maxsobolmark - optimizing test-time compute @QuYuxiao - why RL is crucial for test-time scaling @setlur_amrith - scaling laws for value-based RL…

5.0K

Rafael Rafailov @ NeurIPS@rm_rafailov · Apr 26

And again…

�🇺🇦 Dzmitry Bahdanau@DBahdanau · Apr 26

I am excited to open-source PipelineRL - a scalable async RL implementation with in-flight weight updates. Why wait until your bored GPUs finish all sequences? Just update the weights and continue inference! Code: github.com/ServiceNow/Pip… Blog: huggingface.co/blog/ServiceNo…

2.0K

Rafael Rafailov @ NeurIPS@rm_rafailov · Apr 23

Meta-Search

JJiayi Pan@jiayi_pirate · Apr 23

We explore a new dimension in scaling reasoning models in Adaptive Parallel Reasoning APR lets LMs learn to orchestrate both serial & parallel compute E2E via supervised training + RL — w/ better efficiency and scalability than long CoT on Countdown 🧵 arxiv.org/abs/2504.15466

4.0K

Rafael Rafailov @ NeurIPS@rm_rafailov · Apr 19

Post-training is going to become training

DDeedy@deedydas · Apr 19

Rich Sutton just published his most important essay on AI since The Bitter Lesson: "Welcome to the Era of Experience" Sutton and his advisee Silver argue that the “era of human data,” dominated by supervised pre‑training and RL‑from‑human‑feedback, has hit diminishing returns;…

261

26.0K

Rafael Rafailov @ NeurIPS@rm_rafailov · Apr 16

It strikes again.

PPrime Intellect@PrimeIntellect · Apr 15

Asynchronous RL completely eliminates communication bottlenecks. Our ablation studies confirm we maintain performance even with 4-step delays, making decentralized training viable with weak global interconnects.

2.0K