Scale AI
@scale_AI
To make the best models, you need the best data.
New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).
🤔 How do we train LLMs on real-world tasks where it’s hard to define a single verifiable answer? Our work at @scale_AI introduces Rubrics as Rewards (RaR) — a framework for on-policy post-training that uses structured, checklist-style rubrics as interpretable reward signals. 🧵
New @scale_AI research in collaboration with @AnthropicAI introduces SHADE-Arena, a benchmark to test for AI sabotage. SHADE-Arena evaluates an AI agent's ability to complete a task while secretly pursuing a harmful objective, all while being watched by an AI monitor. 🧵