Cassidy Laidlaw
@cassidy_laidlaw
PhD student at UC Berkeley studying RL and AI safety. Also at https://bsky.app/profile/cassidylaidlaw.bsky.social
We built an AI assistant that plays Minecraft with you. Start building a house—it figures out what you’re doing and jumps in to help. This assistant *wasn't* trained with RLHF. Instead, it's powered by *assistance games*, a better path forward for building AI assistants. 🧵
New work with @yaowenye123: DPO can fail completely when annotators don't provide accurate supervision. We propose an alternative post-training method called ILR that is still able to learn effectively from unreliable feedback!
What happens when humans can’t reliably supervise LLMs during RLHF? In a new paper, we find that unreliable supervision can cause DPO to fail completely. Instead of DPO/RLHF, we propose using human feedback to update the *SFT dataset* and show this works much better! 🧵
What happens when humans can’t reliably supervise LLMs during RLHF? In a new paper, we find that unreliable supervision can cause DPO to fail completely. Instead of DPO/RLHF, we propose using human feedback to update the *SFT dataset* and show this works much better! 🧵