Zac Kenton
@ZacKenton1
Research Scientist in AI safety at DeepMind. Views are my own and don't represent DeepMind.
Eventually, humans will need to supervise superhuman AI - but how? Can we study it now? We don't have superhuman AI, but we do have LLMs. We study protocols where a weaker LLM uses stronger ones to find better answers than it knows itself. Does this work? It’s complicated: 🧵👇

As we make progress towards AGI, developing AI needs to be both innovative and safe. ⚖️ To help ensure this, we’ve made updates to our Frontier Safety Framework - our set of protocols to help us stay ahead of possible severe risks. Find out more → goo.gle/42IuIVf
New alignment theory paper! We present a new scalable oversight protocol (prover-estimator debate) and a proof that honesty is incentivised at equilibrium (with large assumptions, see 🧵), even when the AIs involved have similar available compute.
Come work with me!! @AISecurityInst is building a new alignment team. Our hope is to massively scale up the total global effort going into alignment research – so that we have technical mitigations for superhuman systems before they pose critical risk. 1/4
We made a course on AGI safety, check it out!
We are excited to release a short course on AGI safety! The course offers a concise and accessible introduction to AI alignment problems and our technical & governance approaches, consisting of short recorded talks and exercises (75 minutes total). deepmindsafetyresearch.medium.com/1072adb7912c
We're hiring for our Google DeepMind AGI Safety & Alignment and Gemini Safety teams. Locations: London, NYC, Mountain View, SF. Join us to help build safe AGI. Research Engineer boards.greenhouse.io/deepmind/jobs/…… Research Scientist boards.greenhouse.io/deepmind/jobs/…
🧵 Announcing @open_phil's Technical AI Safety RFP! We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable.
human-AI collaboration skills useful for amplified oversight in alignment <3 deepmindsafetyresearch.medium.com/human-ai-compl…
Studying scalable oversight with LLMs NeurIPS 2024 poster presentation by @noahysiegel youtu.be/8i7dfs1V5yI
Really nice article which features some of the recent empirical work on AI debate which our team, and others, have worked on. We still have a long way to go to fully realise the potential of debate for AI safety. But there are exciting initial signs.
Computer scientists are pitting large language models against each other in debates. The resulting arguments can help a third-party judge determine who’s telling the truth. @stephenornes reports: quantamagazine.org/debate-may-hel…
Excited to share that our scalable oversight paper has been accepted to #NeurIPS2024
Eventually, humans will need to supervise superhuman AI - but how? Can we study it now? We don't have superhuman AI, but we do have LLMs. We study protocols where a weaker LLM uses stronger ones to find better answers than it knows itself. Does this work? It’s complicated: 🧵👇
We are hiring! Google DeepMind's Frontier Safety and Governance team is dedicated to mitigating frontier AI risks; we work closely with technical safety, policy, responsibility, security, and GDM leadership. Please encourage great people to apply! 1/ boards.greenhouse.io/deepmind/jobs/…
arXiv -> alphaXiv Students at Stanford have built alphaXiv, an open discussion forum for arXiv papers. @askalphaxiv You can post questions and comments directly on top of any arXiv paper by changing arXiv to alphaXiv in any URL!
Sparse Autoencoders act like a microscope for AI internals. They're a powerful tool for interpretability, but training costs limit research Announcing Gemma Scope: An open suite of SAEs on every layer & sublayer of Gemma 2 2B & 9B! We hope to enable even more ambitious work
SAEs can be like a microscope for AI inner workings, but they still need a lot of research. To help with that, today we’re sharing GemmaScope: an open suite of hundreds of SAEs on every layer and sublayer of Gemma 2. I’m excited about this for my academic colleagues interested in…
Gemini 1.5 Pro is the safest model on the Scale Adversarial Robustness Leaderboard! We’ve made a number of innovations -- which importantly also led to improved helpfulness -- but the key is making safety a core priority for the entire team, not an afterthought. Read more about…
1/ Scale is announcing our latest SEAL Leaderboard on Adversarial Robustness! 🛡️ Red team-generated prompts 🎯 Focused on universal harm scenarios 🔍 Transparent eval methods SEAL evals are private (not overfit), expert evals that refresh periodically scale.com/leaderboard
🚨New paper: Targeted LAT Improves Robustness to Persistent Harmful Behaviors in LLMs ✅ Improved jailbreak robustness (incl. beating R2D2 with 35x less compute) ✅ Backdoor removal (i.e. solving the “sleeper agent” problem) ✅ Improved unlearning (incl. re-learning robustness)
excited to announce this received an “ICML Best Paper Award”! come see our talk at 10:30 tomorrow
How can we check LLM outputs in domains where we are not experts? We find that non-expert humans answer questions better after reading debates between expert LLMs. Moreover, human judges are more accurate as experts get more persuasive. 📈 github.com/ucl-dark/llm_d…
Circuit discovery techniques aim to find subgraphs of NNs for specific tasks. Are they correct? Which one is the best? 🕵️ Introducing InterpBench: 17 semi-synthetic, realistic transformers with known circuits to evaluate mechanistic interpretability. Read on... 🧵