Chen Bo Calvin Zhang
@calvincbzhang
ML Research Ops @scale_AI | Previously @CHAI_Berkeley @MIT @ETH @OfficialUoM
New @scale_AI research in collaboration with @AnthropicAI introduces SHADE-Arena, a benchmark to test for AI sabotage. SHADE-Arena evaluates an AI agent's ability to complete a task while secretly pursuing a harmful objective, all while being watched by an AI monitor. 🧵
New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).
[5/5] “ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization” we introduce algorithms that in combination with LLM reward generation can find useful reward shapings using online model selection strategies.@calvincbzhang @pulkitology @ZhangWeiHong9
New @scale_AI work on Multilingual Reasoning!
How well do LLMs reason across languages? Introducing MultiNRC, our latest SEAL Leaderboard addition built to test native multilingual reasoning. ⬇️
Check out SHADE-Arena, our new paper in collaboration with @AnthropicAI !
New Anthropic Research: A new set of evaluations for sabotage capabilities. As models gain more agentic abilities, we need to get smarter in how we monitor them. We’re publishing a new set of complex evaluations that test for sabotage—and sabotage-monitoring—capabilities.
@MATSprogram Summer 2025 applications close Apr 18! Come help advance the fields of AI alignment, security, and governance with mentors including @NeelNanda5 @EthanJPerez @OwainEvans_UK @EvanHub @bshlgrs @dawnsongtweets @DavidSKrueger @RichardMCNgo and more!
Check out our behind the scenes fireside chat on Humanity’s Last Exam with : @DanHendrycks (CAIS) & @summeryue0 (Scale AI). Discover key insights about top model performance and what's next for advanced AI evaluation.
Casting reward selection as a model selection leads up to 8x faster learning and 50% better performance! (arxiv.org/abs/2410.13837) ⚡ Provable regret guarantees. 🌟 Easy to implement (github.com/Improbable-AI/…). ⚔️ 1 GPU can do the work of up to 8 GPUs! Presenting ORSO:…
Excited to share my recent work with @gabe_mrgl , Martin Pettico, and @pulkitology . We’re pushing the limits of whole-body control to make robots faster, stronger, and more athletic!
🚨 New paper: We find that even safety-tuned LLMs learn to manipulate vulnerable users when training them further with user feedback 🤖😵💫 In our simulated scenarios, LLMs learn to e.g. selectively validate users' self-destructive behaviors, or deceive them into giving 👍. 🧵👇
@CHAI_Berkeley applications for 2025 close in just over a day! ⏰‼️ Apply now! Details below:
🚀 Stronger, simpler, and better! 🚀 Introducing Value Augmented Sampling (VAS) - our new algorithm for LLM alignment and personalization that outperforms existing methods!