David Lindner
@davlindner
Making AI safer @GoogleDeepMind
Excited to share some technical details about our approach to scheming and deceptive alignment as outlined in Google's Frontier Safety Framework! (1) current models are not yet capable of realistic scheming (2) CoT monitoring is a promising mitigation for future scheming
As models advance, a key AI safety concern is deceptive alignment / "scheming" – where AI might covertly pursue unintended goals. Our paper "Evaluating Frontier Models for Stealth and Situational Awareness" assesses whether current models can scheme. arxiv.org/abs/2505.01420
I'll be presenting MONA at ICML in the afternoon poster session today. Come stop by from 4:30 pm at East Exhibition Hall E-902
New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵
Legible chain-of-thought is incredibly useful for building safe AI. So let's not loose this property!
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…
I'll be at ICML this week, looking forward to catching up with old friends and meeting new faces. Lmk if you want to chat!
We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.
Two new papers that elaborate on our approach to deceptive alignment! First paper: we evaluate the model's *stealth* and *situational awareness* -- if they don't have these capabilities, they likely can't cause severe harm. x.com/vkrakovna/stat…
As models advance, a key AI safety concern is deceptive alignment / "scheming" – where AI might covertly pursue unintended goals. Our paper "Evaluating Frontier Models for Stealth and Situational Awareness" assesses whether current models can scheme. arxiv.org/abs/2505.01420
Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models…
New episode with @SamuelAlbanie, where we discuss the recent Google DeepMind paper "An Approach to Technical AGI Safety and Security"! Link to watch below.
Episode 45 - Samuel Albanie on DeepMind's AGI Safety Approach axrp.net/episode/2025/0…
Had a great conversation with Daniel about our MONA paper. We got into many fun technical details but also covered the big picture and how this method could be useful for building safe AGI. Thanks for having me on!
New episode with @davlindner, covering his work on MONA! Check it out - video link in reply.