Rohin Shah
@rohinmshah
AGI Safety & Alignment @ Google DeepMind
Now with an approach to deceptive alignment -- first such policy to do so! x.com/GoogleDeepMind…
As we make progress towards AGI, developing AI needs to be both innovative and safe. ⚖️ To help ensure this, we’ve made updates to our Frontier Safety Framework - our set of protocols to help us stay ahead of possible severe risks. Find out more → goo.gle/42IuIVf
Chain of thought monitoring looks valuable enough that we’ve put it in our Frontier Safety Framework to address deceptive alignment. This paper is a good explanation of why we’re optimistic – but also why it may be fragile, and what to do to preserve it. x.com/balesni/status…
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…
Per our Frontier Safety Framework, we continue to test our models for critical capabilities. Here’s the updated model card for Gemini 2.5Pro with frontier safety evaluations + explanation of how our safety buffer / alert thresholds approach applies to 2.0, 2.5, and what’s coming.…
What if LLMs are sometimes capable of doing a task but don't try hard enough to do it? In a new paper, we use subtasks to assess capabilities. Perhaps surprisingly, LLMs often fail to fully employ their capabilities, i.e. they are not fully *goal-directed* 🧵
More details on one of the roles we're hiring for on the GDM safety team x.com/ArthurConmy/st…
We are hiring Applied Interpretability researchers on the GDM Mech Interp Team!🧵 If interpretability is ever going to be useful, we need it to be applied at the frontier. Come work with @NeelNanda5, the @GoogleDeepMind AGI Safety team, and me: apply by 28th February as a…
New release! Great for a short, high-level overview of a variety of different areas within AGI safety that we're excited about. x.com/vkrakovna/stat…
We are excited to release a short course on AGI safety! The course offers a concise and accessible introduction to AI alignment problems and our technical & governance approaches, consisting of short recorded talks and exercises (75 minutes total). deepmindsafetyresearch.medium.com/1072adb7912c
Big call for proposals from Open Phil! I'd love to see great safety research come out of this that we can import at GDM. I could wish there was more of a focus on *building aligned AI systems* but the areas they do list are important! x.com/MaxNadeau_/sta…
🧵 Announcing @open_phil's Technical AI Safety RFP! We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable.
New AI safety paper! Introduces MONA, which avoids incentivizing alien long-term plans. This also implies “no long-term RL-induced steganography” (see the loans environment). So you can also think of this as a project about legible chains of thought. x.com/davlindner/sta…
New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵
Some nice lessons from our Rater Assist team about how to leverage AI assistance for rating tasks! x.com/shubadubadub/s…
How do we ensure humans can still effectively oversee increasingly powerful AI systems? In our blog, we argue that achieving Human-AI complementarity is an underexplored yet vital piece of this puzzle! And, it’s hard, but we achieved it. 🧵(1/10)
Update: we’re hiring for multiple positions! Join GDM to shape the frontier of AI safety, governance, and strategy. Priority areas: forecasting AI, geopolitics and AGI efforts, FSF risk management, agents, global governance. More details below: 🧵
We are hiring! Google DeepMind's Frontier Safety and Governance team is dedicated to mitigating frontier AI risks; we work closely with technical safety, policy, responsibility, security, and GDM leadership. Please encourage great people to apply! 1/ boards.greenhouse.io/deepmind/jobs/…