Victoria Krakovna
@vkrakovna
Research scientist in AI alignment at Google DeepMind. Co-founder of Future of Life Institute @flixrisk. Views are my own and do not represent GDM or FLI.
Holy shit these quotes from Congress are absolutely eye-popping: "...this week lawmakers demonstrated a level of AGI situational awareness that would have been unthinkable just months ago. •“Whether it’s American AI or Chinese AI, it should not be released until we know it’s…
‼️📝 Our new AI Safety Index is out! ➡️ Following our 2024 index, 6 independent AI experts rated leading AI companies - @OpenAI, @AnthropicAI, @AIatMeta, @GoogleDeepMind, @xAI, @deepseek_ai & Zhipu AI - across critical safety and security domains. So what were the results? 🧵👇
Modern reasoning models think in plain English. Monitoring their thoughts could be a powerful, yet fragile, tool for overseeing future AI systems. I and researchers across many organizations think we should work to evaluate, preserve, and even improve CoT monitorability.
Chain of thought monitoring looks valuable enough that we’ve put it in our Frontier Safety Framework to address deceptive alignment. This paper is a good explanation of why we’re optimistic – but also why it may be fragile, and what to do to preserve it. x.com/balesni/status…
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…
Two new papers that elaborate on our approach to deceptive alignment! First paper: we evaluate the model's *stealth* and *situational awareness* -- if they don't have these capabilities, they likely can't cause severe harm. x.com/vkrakovna/stat…
As models advance, a key AI safety concern is deceptive alignment / "scheming" – where AI might covertly pursue unintended goals. Our paper "Evaluating Frontier Models for Stealth and Situational Awareness" assesses whether current models can scheme. arxiv.org/abs/2505.01420
Great work from my colleagues stress-testing chain-of-thought monitoring. For complex behaviors, models have to expose their reasoning in the chain of thought, making it monitorable. Paper: arxiv.org/abs/2507.05246 Blog post: deepmindsafetyresearch.medium.com/evaluating-and…
Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models…
The moratorium just got taken out of the budget bill in a LANDSLIDE vote. 99 to 1. Incredible. Thank you to the lawmakers, the children's advocates, the artists and creators, the voters, the labor groups, and everyone who spoke out against the harmful AI law moratorium.
The Singapore Consensus is on arXiv now -- arxiv.org/abs/2506.20702 It offers: 1. An overview of consensus technical AI safety priorities 2. An example of widespread international collab & agreement
I'm honored to be part of arXiv:2506.20702, "The Singapore Consensus on Global AI Safety Research Priorities". Across companies and countries, there's more agreement than you'd think (paper URL in replies):
New video, about how to work in technical AI Safety research! (link in reply)
Gemini 2.5 Pro system card has now been updated with frontier safety evaluations results, testing for critical capabilities in CBRN, cybersecurity, ML R&D and deceptive alignment. storage.googleapis.com/model-cards/do…
IMO, this isn't much of an update against CoT monitoring hopes. They show unfaithfulness when the reasoning is minimal enough that it doesn't need CoT. But, my hopes for CoT monitoring are because models will have to reason a lot to end up misaligned and cause huge problems. 🧵
New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.
If you'd like to learn more about GDM's approach to AGI safety and security, but have limited time, check out this 3-minute talk in our AGI safety course for a quick summary: youtube.com/watch?v=RGh8wP…
Just released GDM’s 100+ page approach to AGI safety & security! (Don’t worry, there’s a 10 page summary.) AGI will be transformative. It enables massive benefits, but could also pose risks. Responsible development means proactively preparing for severe harms before they arise.