Rohin Shah

@rohinmshah

AGI Safety & Alignment @ Google DeepMind

London, UK

Joined October 2017

93Following

8KFollowers

Pinned

Rohin Shah@rohinmshah · Feb 5

Now with an approach to deceptive alignment -- first such policy to do so! x.com/GoogleDeepMind…

GGoogle DeepMind@GoogleDeepMind · Feb 4

As we make progress towards AGI, developing AI needs to be both innovative and safe. ⚖️ To help ensure this, we’ve made updates to our Frontier Safety Framework - our set of protocols to help us stay ahead of possible severe risks. Find out more → goo.gle/42IuIVf

113

7.0K

Rohin Shah@rohinmshah · Jul 15

Chain of thought monitoring looks valuable enough that we’ve put it in our Frontier Safety Framework to address deceptive alignment. This paper is a good explanation of why we’re optimistic – but also why it may be fragile, and what to do to preserve it. x.com/balesni/status…

MMikita Balesni 🇺🇦@balesni · Jul 15

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…

6.0K

Rohin Shah Retweeted

Anca Dragan@ancadianadragan · Apr 29

Per our Frontier Safety Framework, we continue to test our models for critical capabilities. Here’s the updated model card for Gemini 2.5Pro with frontier safety evaluations + explanation of how our safety buffer / alert thresholds approach applies to 2.0, 2.5, and what’s coming.…

11.0K

Rohin Shah Retweeted

Tom Everitt@tom4everitt · Apr 17

What if LLMs are sometimes capable of doing a task but don't try hard enough to do it? In a new paper, we use subtasks to assess capabilities. Perhaps surprisingly, LLMs often fail to fully employ their capabilities, i.e. they are not fully *goal-directed* 🧵

239

38.0K

Rohin Shah@rohinmshah · Feb 25

More details on one of the roles we're hiring for on the GDM safety team x.com/ArthurConmy/st…

AArthur Conmy@ArthurConmy · Feb 25

We are hiring Applied Interpretability researchers on the GDM Mech Interp Team!🧵 If interpretability is ever going to be useful, we need it to be applied at the frontier. Come work with @NeelNanda5, the @GoogleDeepMind AGI Safety team, and me: apply by 28th February as a…

2.0K

Rohin Shah@rohinmshah · Feb 14

New release! Great for a short, high-level overview of a variety of different areas within AGI safety that we're excited about. x.com/vkrakovna/stat…

VVictoria Krakovna@vkrakovna · Feb 14

We are excited to release a short course on AGI safety! The course offers a concise and accessible introduction to AI alignment problems and our technical & governance approaches, consisting of short recorded talks and exercises (75 minutes total). deepmindsafetyresearch.medium.com/1072adb7912c

9.0K

Rohin Shah@rohinmshah · Feb 8

Big call for proposals from Open Phil! I'd love to see great safety research come out of this that we can import at GDM. I could wish there was more of a focus on *building aligned AI systems* but the areas they do list are important! x.com/MaxNadeau_/sta…

MMax Nadeau@MaxNadeau_ · Feb 6

🧵 Announcing @open_phil's Technical AI Safety RFP! We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable.

5.0K

Rohin Shah@rohinmshah · Jan 23

New AI safety paper! Introduces MONA, which avoids incentivizing alien long-term plans. This also implies “no long-term RL-induced steganography” (see the loans environment). So you can also think of this as a project about legible chains of thought. x.com/davlindner/sta…

DDavid Lindner@davlindner · Jan 23

New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵

7.0K

Rohin Shah@rohinmshah · Dec 24

Some nice lessons from our Rater Assist team about how to leverage AI assistance for rating tasks! x.com/shubadubadub/s…

RRishub Jain@shubadubadub · Dec 23

How do we ensure humans can still effectively oversee increasingly powerful AI systems? In our blog, we argue that achieving Human-AI complementarity is an underexplored yet vital piece of this puzzle! And, it’s hard, but we achieved it. 🧵(1/10)

4.0K

Rohin Shah@rohinmshah · Oct 7

Update: we’re hiring for multiple positions! Join GDM to shape the frontier of AI safety, governance, and strategy. Priority areas: forecasting AI, geopolitics and AGI efforts, FSF risk management, agents, global governance. More details below: 🧵

AAllan Dafoe@AllanDafoe · Sep 3

We are hiring! Google DeepMind's Frontier Safety and Governance team is dedicated to mitigating frontier AI risks; we work closely with technical safety, policy, responsibility, security, and GDM leadership. Please encourage great people to apply! 1/ boards.greenhouse.io/deepmind/jobs/…

144

43.0K