Yunhao (Robin) Tang

@robinphysics

Interested in RL. Science @MistralAI. Prev Llama post-training @AIatMeta, Gemini post-training and deep RL research @Deepmind, PhD @Columbia

Joined November 2018

732Following

1KFollowers

Pinned

Yunhao (Robin) Tang@robinphysics · May 27, 2024

Online interaction is probably a defining property of RL. But with the rise of offline algo, it is not clear if the “online” bit of RL is necessary for RLHF. We hypothesis test the causes of the perf gap between online and offline alignment. arxiv.org/pdf/2405.08448… Details in🧵

robinphysics's tweet image. Online interaction is probably a defining property of RL. But with the rise of offline algo, it is not clear if the “online” bit of RL is necessary for RLHF.

We hypothesis test the causes of the perf gap between online and offline alignment. arxiv.org/pdf/2405.08448…

Details in🧵

11.0K

Yunhao (Robin) Tang Retweeted

Mistral AI@MistralAI · Jun 10

Announcing Magistral, our first reasoning model designed to excel in domain-specific, transparent, and multilingual reasoning.

109

457

3.0K

588

709.0K

Yunhao (Robin) Tang Retweeted

Zac Kenton@ZacKenton1 · Jul 8, 2024

Eventually, humans will need to supervise superhuman AI - but how? Can we study it now? We don't have superhuman AI, but we do have LLMs. We study protocols where a weaker LLM uses stronger ones to find better answers than it knows itself. Does this work? It’s complicated: 🧵👇

245

167

53.0K

Yunhao (Robin) Tang@robinphysics · May 27, 2024

Thanks @_akhaliq for promoting our work! Unlike regular RL where golden r(s,a) are available and online is generally deemed better than offline, in RLHF this is less clear. Complementary to some concurrent work, we investigate causes to the perf gap between online vs. offline.

AAK@_akhaliq · May 15, 2024

Understanding the performance gap between online and offline alignment algorithms Reinforcement learning from human feedback (RLHF) is the canonical framework for large language model alignment. However, rising popularity in offline alignment algorithms challenge the need

2.0K

Yunhao (Robin) Tang Retweeted

Michal Valko@misovalko · Dec 4, 2023

Fast-forward ⏩ alignment research from @GoogleDeepMind ! Our latest results enhance alignment outcomes in Large Language Models (LLMs). Presenting NashLLM!

128

806

517

192.0K

Yunhao (Robin) Tang@robinphysics · Jul 27, 2023

Interested in how **non-contrastive representation learning for RL** is magically equivalent to **gradient-based PCA/SVD on the transition matrix** and hence won't collapse and capture spectral info about the transition? Come talk to us at #ICML2023 Hall 1 #308 at 1:30pm

YYunhao (Robin) Tang@robinphysics · Jul 24, 2023

Interested in how non-contrastive representation learning works in RL? We show (1) Why representations do not collapses (2) How it relates to gradient PCA / SVD of transition matrix Understanding Self-Predictive Learning for RL #ICML2023 @GoogleDeepMind arxiv.org/pdf/2212.03319

10.0K

Yunhao (Robin) Tang Retweeted

Will Dabney@wwdabney · Jul 25, 2023

Even if all you want is a value function, using quantile TD (QTD) can give a better estimate than standard TD. Today at #ICML2023, Mark Rowland presents our latest work on distributional RL in collaboration with @robinphysics, @clarelyle, Remi Munos, @marcgbellemare #809 @ 2pm

3.0K

Yunhao (Robin) Tang@robinphysics · Jul 24, 2023

robinphysics's tweet image. Interested in how non-contrastive representation learning works in RL? We show
(1) Why representations do not collapses
(2) How it relates to gradient PCA / SVD of transition matrix
Understanding Self-Predictive Learning for RL #ICML2023 @GoogleDeepMind arxiv.org/pdf/2212.03319

161

110

30.0K