Owen Oertell (@owenoertell)

Pinned

O

Owen Oertell@owenoertell · Jul 16

TLDR: Heuristics such as clipping cause weird biases. Let's move away from heuristics to principled methods so at least we know what they are optimizing

GGokul Swamy@g_k_swamy · Jul 15

Recent work has seemed somewhat magical: how can RL with *random* rewards make LLMs reason? We pull back the curtain on these claims and find out this unexpected behavior hinges on the inclusion of certain *heuristics* in the RL algorithm. Our blog post: tinyurl.com/heuristics-con…

2

4

99

38

11.0K

Pinned

Owen Oertell Retweeted

N

Nicolas Espinosa Dice@nico_espinosa_d · Jun 12

by incorporating self-consistency during offline RL training, we unlock three orthogonal directions of scaling: 1. efficient training (i.e. limit backprop through time) 2. expressive model classes (e.g. flow matching) 3. inference-time scaling (sequential and parallel) which,…

2

15

74

73

12.0K

Owen Oertell Retweeted

Z

Zhiyong Wang @ICML 2025@Zhiyong16403503 · Jul 16

Happy to share our work "Provable Zero-Shot Generalization in Offline Reinforcement Learning" at ICML 2025! 📍 Poster | 🗓️July 16, 11:00 AM – 1:30 PM 📌 West Exhibition Hall B2-B3 #W-1012 🤖 How can offline RL agents generalize zero-shot to unseen environments? We introduce…

0

2

11

3

688

O

Owen Oertell@owenoertell · Jul 16

Does RL actually learn positively under random rewards when optimizing Qwen on MATH? Is Qwen really that magical such that even RLing on random rewards can make it reason better? Following prior work on spurious rewards on RL, we ablated algorithms. It turns out that if you…

GGokul Swamy@g_k_swamy · Jul 15

Recent work has seemed somewhat magical: how can RL with *random* rewards make LLMs reason? We pull back the curtain on these claims and find out this unexpected behavior hinges on the inclusion of certain *heuristics* in the RL algorithm. Our blog post: tinyurl.com/heuristics-con…

1

12

104

55

12.0K

O

Owen Oertell@owenoertell · Jul 15

This is interesting

GGokul Swamy@g_k_swamy · Jul 15

We see similar results on a didactic bandit problem -- i.e. a problem that has nothing to do with LLMs or reasoning! This implies that PPO / GRPO are fundamentally *not* following the true policy gradient.

0

1

2

1

851

O

Owen Oertell@owenoertell · Jul 15

great work — shows the main reason why RL even with random rewards was improving the models

GGokul Swamy@g_k_swamy · Jul 15

Recent work has seemed somewhat magical: how can RL with *random* rewards make LLMs reason? We pull back the curtain on these claims and find out this unexpected behavior hinges on the inclusion of certain *heuristics* in the RL algorithm. Our blog post: tinyurl.com/heuristics-con…

1

15

194

127

32.0K

O

Owen Oertell@owenoertell · Jul 15

Puzzled by random reward? Check out our work to see what’s really going on!

GGokul Swamy@g_k_swamy · Jul 15

Recent work has seemed somewhat magical: how can RL with *random* rewards make LLMs reason? We pull back the curtain on these claims and find out this unexpected behavior hinges on the inclusion of certain *heuristics* in the RL algorithm. Our blog post: tinyurl.com/heuristics-con…

0

2

11

2

1.0K

Owen Oertell Retweeted

G

Gokul Swamy@g_k_swamy · Jul 15

Recent work has seemed somewhat magical: how can RL with *random* rewards make LLMs reason? We pull back the curtain on these claims and find out this unexpected behavior hinges on the inclusion of certain *heuristics* in the RL algorithm. Our blog post: tinyurl.com/heuristics-con…

11

69

477

425

81.0K

Owen Oertell Retweeted

K

Kempner Institute at Harvard University@KempnerInst · Jul 14

New in the #DeeperLearningBlog: Researchers from the #KempnerInstitute, @Cornell and @CMU_Robotics introduce a new method for improving offline RL by scaling-up test-time compute. kempnerinstitute.harvard.edu/research/deepe… #AI #RL (1/2)

1

3

20

10

3.0K

O

Owen Oertell@owenoertell · Jun 19

Instead of formalizing reward-guided fine-tuning diffusion models as (discrete or even continuous) MDPs and then using RL or control to optimize them (just way too complicated), simple interactive online learning with classification oracles is sufficient to achieve strong results…

OOwen Oertell@owenoertell · Jun 18

Tired of over-optimized generations that stray too far from the base distribution? We present SLCD: Supervised Learning based Controllable Diffusion, which (provably) solves the KL constrained reward maximization problem for diffusion through supervised learning! (1/n)

1

4

21

11

3.0K

O

Owen Oertell@owenoertell · Jun 18

At this rate, @owenoertell is gonna have finished a PhD before he even applies to one. Check out his clean and simple (i.e. regression / classification-based) algorithm to train diffusion models without the headaches of full RL!

OOwen Oertell@owenoertell · Jun 18

Tired of over-optimized generations that stray too far from the base distribution? We present SLCD: Supervised Learning based Controllable Diffusion, which (provably) solves the KL constrained reward maximization problem for diffusion through supervised learning! (1/n)

1

9

7

2.0K

Owen Oertell Retweeted

O

Owen Oertell@owenoertell · Jun 18

Tired of over-optimized generations that stray too far from the base distribution? We present SLCD: Supervised Learning based Controllable Diffusion, which (provably) solves the KL constrained reward maximization problem for diffusion through supervised learning! (1/n)

2

10

27

10

7.0K

O

Owen Oertell@owenoertell · Jun 2

Excited for my first day back nvidia! Happy to chat with anyone in sf interested in finetuning {diffusion, language} models!

owenoertell's tweet image. Excited for my first day back nvidia! Happy to chat with anyone in sf interested in finetuning {diffusion, language} models!

0

10

1

603