Nicolas Espinosa Dice (@nico_espinosa_d)

Pinned

N

Nicolas Espinosa Dice@nico_espinosa_d · Jun 12

by incorporating self-consistency during offline RL training, we unlock three orthogonal directions of scaling: 1. efficient training (i.e. limit backprop through time) 2. expressive model classes (e.g. flow matching) 3. inference-time scaling (sequential and parallel) which,…

2

15

74

73

12.0K

Pinned

N

Nicolas Espinosa Dice@nico_espinosa_d · Jul 21

excited to share our blog post on how to scale offline RL at test-time!

KKempner Institute at Harvard University@KempnerInst · Jul 14

New in the #DeeperLearningBlog: Researchers from the #KempnerInstitute, @Cornell and @CMU_Robotics introduce a new method for improving offline RL by scaling-up test-time compute. kempnerinstitute.harvard.edu/research/deepe… #AI #RL (1/2)

0

1

9

2

910

Pinned

N

Nicolas Espinosa Dice@nico_espinosa_d · Jul 18

How can small LLMs match or even surpass frontier models like DeepSeek R1 and o3 Mini in math competition (AIME & HMMT) reasoning? Prior work seems to suggest that ideas like PRMs do not really work or scale well for long context reasoning. @kaiwenw_ai will reveal how a novel…

KKaiwen Wang@kaiwenw_ai · Jul 17

I’m presenting two papers on value-based RL for post-training & reasoning on Friday at @ai4mathworkshop at #ICML2025! 1️⃣ Q#: lays theoretical foundations for value-based RL for post-training LMs; 2️⃣ VGS: practical value-guided search scaled up for long CoT reasoning. 🧵👇

0

8

23

15

5.0K

Pinned

Nicolas Espinosa Dice Retweeted

K

Kaiwen Wang@kaiwenw_ai · Jul 17

I’m presenting two papers on value-based RL for post-training & reasoning on Friday at @ai4mathworkshop at #ICML2025! 1️⃣ Q#: lays theoretical foundations for value-based RL for post-training LMs; 2️⃣ VGS: practical value-guided search scaled up for long CoT reasoning. 🧵👇

2

15

47

21

11.0K

Pinned

N

Nicolas Espinosa Dice@nico_espinosa_d · Jul 14

Check out @nico_espinosa_d's blog post on how we can enable test-time scaling of policies learned via offline RL! I am particularly impressed by the figures :).

KKempner Institute at Harvard University@KempnerInst · Jul 14

New in the #DeeperLearningBlog: Researchers from the #KempnerInstitute, @Cornell and @CMU_Robotics introduce a new method for improving offline RL by scaling-up test-time compute. kempnerinstitute.harvard.edu/research/deepe… #AI #RL (1/2)

0

1

9

2

1.0K

Pinned

Nicolas Espinosa Dice Retweeted

N

Nived Rajaraman@Nived_Rajaraman · May 9

Announcing the first workshop on Foundations of Post-Training (FoPT) at COLT 2025! 📝 Soliciting abstracts/posters exploring theoretical & practical aspects of post-training and RL with language models! │ 🗓️ Deadline: May 19, 2025

1

28

87

30

31.0K

Pinned

Nicolas Espinosa Dice Retweeted

G

Gokul Swamy@g_k_swamy · Jun 19

It was a dream come true to teach the course I wish existed at the start of my PhD. We built up the algorithmic foundations of modern-day RL, imitation learning, and RLHF, going deeper than the usual "grab bag of tricks". All 25 lectures + 150 pages of notes are now public! 🧵

8

92

708

932

52.0K

Pinned

Nicolas Espinosa Dice Retweeted

O

Owen Oertell@owenoertell · Jun 18

Tired of over-optimized generations that stray too far from the base distribution? We present SLCD: Supervised Learning based Controllable Diffusion, which (provably) solves the KL constrained reward maximization problem for diffusion through supervised learning! (1/n)

2

10

27

10

7.0K

Pinned

N

Nicolas Espinosa Dice@nico_espinosa_d · Jun 13

A simple and efficient approach to RL for generative policies! Prior work typically requires massively extending the RL horizon or performing some kind of importance weighting followed by flow or score matching. By deploying a shortcut model, our SORL enables efficient training…

NNicolas Espinosa Dice@nico_espinosa_d · Jun 12

by incorporating self-consistency during offline RL training, we unlock three orthogonal directions of scaling: 1. efficient training (i.e. limit backprop through time) 2. expressive model classes (e.g. flow matching) 3. inference-time scaling (sequential and parallel) which,…

1

5

31

17

5.0K

Pinned

N

Nicolas Espinosa Dice@nico_espinosa_d · Jun 12

Shortcut models enable scaling offline RL, both at train-time at test-time! We beat so many other algorithms on so many tasks we had to stick most of the results in the appendix 😅. Very proud of @nico_espinosa_d for spearheading this project, check out his thread!

NNicolas Espinosa Dice@nico_espinosa_d · Jun 12

by incorporating self-consistency during offline RL training, we unlock three orthogonal directions of scaling: 1. efficient training (i.e. limit backprop through time) 2. expressive model classes (e.g. flow matching) 3. inference-time scaling (sequential and parallel) which,…

0

5

16

7

2.0K

Pinned

Nicolas Espinosa Dice Retweeted

G

Gokul Swamy@g_k_swamy · Jun 10

Say ahoy to 𝚂𝙰𝙸𝙻𝙾𝚁⛵: a new paradigm of *learning to search* from demonstrations, enabling test-time reasoning about how to recover from mistakes w/o any additional human feedback! 𝚂𝙰𝙸𝙻𝙾𝚁 ⛵ out-performs Diffusion Policies trained via behavioral cloning on 5-10x data!

10

72

255

126

69.0K

Pinned

N

Nicolas Espinosa Dice@nico_espinosa_d · May 14

why agents must be robust to misspecification...

OOverleaf@overleaf · May 14

⚠️ Attention: The site is currently down. Our engineering team is investigating. We will update as soon as possible. You can track progress here: status.overleaf.com Sorry for any inconvenience.

1

0

2

0

476