Geoffrey Irving
@geoffreyirving
Chief Scientist at the UK AI Security Institute (AISI). Previously DeepMind, OpenAI, Google Brain, etc.
New alignment theory paper! We present a new scalable oversight protocol (prover-estimator debate) and a proof that honesty is incentivised at equilibrium (with large assumptions, see 🧵), even when the AIs involved have similar available compute.

My team at @AISecurityInst is hiring! This is an awesome opportunity to get involved with cutting-edge scientific research inside government on frontier AI models. I genuinely love my job and the team 🤗 Link: civilservicejobs.service.gov.uk/csr/jobs.cgi?j… More Info: ⬇️
Once I copied an interval * interval multiplication routine from a paper, and formally proved it correct. But I had made a typo when copying it. (Fortunately the typo didn’t affect the correctness.)
We at @AISecurityInst worked with @OpenAI to test & improve Agent’s safeguards prior to release. A few notes on our experience🧵 1/4
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…
A huge component of the AI Security Institute's impact is tied to the scientific quality of our capability evaluations of LLMs. If you find details of rigorous experimental design exciting, please apply to Coz's team!
We're hiring a Senior Researcher for the Science of Evaluation team! We are an internal red-team, stress-testing the methods and evidence behind AISI’s evaluations. If you're sharp, methodologically rigorous, and want shape research and policy, this role might be for you! 🧵
log(n): grows very slowly with n loglog(n): bounded above by 4 logloglog(n): constant loglogloglogloglogloglog(n): decreasing
New work out: We demonstrate a new attack against stacked safeguards and analyse defence in depth strategies. Excited for this joint collab between @farairesearch and @AISecurityInst to be out!
1/ "Swiss cheese security", stacking layers of imperfect defenses, is a key part of AI companies' plans to safeguard models, and is used to secure Anthropic's Opus 4 model. Our new STACK attack breaks each layer in turn, highlighting this approach may be less secure than hoped.
📢 £18m grant opportunity in Safeguarded AI: we're looking to catalyse the creation of a new UK-based non-profit to lead groundbreaking machine learning research for provably safe AI. Learn more and apply by 1 October 2025: link.aria.org.uk/ta2-phase2-x