Yunhao (Robin) Tang
@robinphysics
Interested in RL. Science @MistralAI. Prev Llama post-training @AIatMeta, Gemini post-training and deep RL research @Deepmind, PhD @Columbia
Online interaction is probably a defining property of RL. But with the rise of offline algo, it is not clear if the “online” bit of RL is necessary for RLHF. We hypothesis test the causes of the perf gap between online and offline alignment. arxiv.org/pdf/2405.08448… Details in🧵

Announcing Magistral, our first reasoning model designed to excel in domain-specific, transparent, and multilingual reasoning.
Eventually, humans will need to supervise superhuman AI - but how? Can we study it now? We don't have superhuman AI, but we do have LLMs. We study protocols where a weaker LLM uses stronger ones to find better answers than it knows itself. Does this work? It’s complicated: 🧵👇
Thanks @_akhaliq for promoting our work! Unlike regular RL where golden r(s,a) are available and online is generally deemed better than offline, in RLHF this is less clear. Complementary to some concurrent work, we investigate causes to the perf gap between online vs. offline.
Understanding the performance gap between online and offline alignment algorithms Reinforcement learning from human feedback (RLHF) is the canonical framework for large language model alignment. However, rising popularity in offline alignment algorithms challenge the need
Fast-forward ⏩ alignment research from @GoogleDeepMind ! Our latest results enhance alignment outcomes in Large Language Models (LLMs). Presenting NashLLM!
Interested in how **non-contrastive representation learning for RL** is magically equivalent to **gradient-based PCA/SVD on the transition matrix** and hence won't collapse and capture spectral info about the transition? Come talk to us at #ICML2023 Hall 1 #308 at 1:30pm
Interested in how non-contrastive representation learning works in RL? We show (1) Why representations do not collapses (2) How it relates to gradient PCA / SVD of transition matrix Understanding Self-Predictive Learning for RL #ICML2023 @GoogleDeepMind arxiv.org/pdf/2212.03319
Even if all you want is a value function, using quantile TD (QTD) can give a better estimate than standard TD. Today at #ICML2023, Mark Rowland presents our latest work on distributional RL in collaboration with @robinphysics, @clarelyle, Remi Munos, @marcgbellemare #809 @ 2pm
Interested in how non-contrastive representation learning works in RL? We show (1) Why representations do not collapses (2) How it relates to gradient PCA / SVD of transition matrix Understanding Self-Predictive Learning for RL #ICML2023 @GoogleDeepMind arxiv.org/pdf/2212.03319
