Zac Kenton

@ZacKenton1

Research Scientist in AI safety at DeepMind. Views are my own and don't represent DeepMind.

London, England

Joined May 2014

1KFollowing

2KFollowers

Pinned

Zac Kenton@ZacKenton1 · Jul 8, 2024

Eventually, humans will need to supervise superhuman AI - but how? Can we study it now? We don't have superhuman AI, but we do have LLMs. We study protocols where a weaker LLM uses stronger ones to find better answers than it knows itself. Does this work? It’s complicated: 🧵👇

ZacKenton1's tweet image. Eventually, humans will need to supervise superhuman AI - but how? Can we study it now?

We don't have superhuman AI, but we do have LLMs. We study protocols where a weaker LLM uses stronger ones to find better answers than it knows itself.

Does this work? It’s complicated: 🧵👇

245

168

53.0K

Pinned

Zac Kenton Retweeted

Google DeepMind@GoogleDeepMind · Feb 4

As we make progress towards AGI, developing AI needs to be both innovative and safe. ⚖️ To help ensure this, we’ve made updates to our Frontier Safety Framework - our set of protocols to help us stay ahead of possible severe risks. Find out more → goo.gle/42IuIVf

100

511

107.0K

Zac Kenton Retweeted

Geoffrey Irving@geoffreyirving · Jun 17

New alignment theory paper! We present a new scalable oversight protocol (prover-estimator debate) and a proof that honesty is incentivised at equilibrium (with large assumptions, see 🧵), even when the AIs involved have similar available compute.

348

242

27.0K

Zac Kenton Retweeted

Benjamin Hilton@benjamin_hilton · Feb 25

Come work with me!! @AISecurityInst is building a new alignment team. Our hope is to massively scale up the total global effort going into alignment research – so that we have technical mitigations for superhuman systems before they pose critical risk. 1/4

126

7.0K

Zac Kenton@ZacKenton1 · Feb 14

We made a course on AGI safety, check it out!

VVictoria Krakovna@vkrakovna · Feb 14

We are excited to release a short course on AGI safety! The course offers a concise and accessible introduction to AI alignment problems and our technical & governance approaches, consisting of short recorded talks and exercises (75 minutes total). deepmindsafetyresearch.medium.com/1072adb7912c

2.0K

Zac Kenton@ZacKenton1 · Feb 10

We're hiring for our Google DeepMind AGI Safety & Alignment and Gemini Safety teams. Locations: London, NYC, Mountain View, SF. Join us to help build safe AGI. Research Engineer boards.greenhouse.io/deepmind/jobs/…… Research Scientist boards.greenhouse.io/deepmind/jobs/…

282

237

38.0K

Zac Kenton Retweeted

Max Nadeau@MaxNadeau_ · Feb 6

🧵 Announcing @open_phil's Technical AI Safety RFP! We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable.

251

183

80.0K

Zac Kenton Retweeted

Anca Dragan@ancadianadragan · Dec 27

human-AI collaboration skills useful for amplified oversight in alignment <3 deepmindsafetyresearch.medium.com/human-ai-compl…

4.0K

Zac Kenton Retweeted

Samuel Albanie 🇬🇧@SamuelAlbanie · Dec 19

Studying scalable oversight with LLMs NeurIPS 2024 poster presentation by @noahysiegel youtu.be/8i7dfs1V5yI

1.0K

Zac Kenton@ZacKenton1 · Nov 8

Really nice article which features some of the recent empirical work on AI debate which our team, and others, have worked on. We still have a long way to go to fully realise the potential of debate for AI safety. But there are exciting initial signs.

QQuanta Magazine@QuantaMagazine · Nov 8

Computer scientists are pitting large language models against each other in debates. The resulting arguments can help a third-party judge determine who’s telling the truth. @stephenornes reports: quantamagazine.org/debate-may-hel…

964

Zac Kenton@ZacKenton1 · Sep 26

Excited to share that our scalable oversight paper has been accepted to #NeurIPS2024

ZZac Kenton@ZacKenton1 · Jul 8, 2024

4.0K

Zac Kenton Retweeted

Allan Dafoe@AllanDafoe · Sep 3

We are hiring! Google DeepMind's Frontier Safety and Governance team is dedicated to mitigating frontier AI risks; we work closely with technical safety, policy, responsibility, security, and GDM leadership. Please encourage great people to apply! 1/ boards.greenhouse.io/deepmind/jobs/…

159

66.0K

Zac Kenton Retweeted

Stanford AI Lab@StanfordAILab · Jul 31

arXiv -> alphaXiv Students at Stanford have built alphaXiv, an open discussion forum for arXiv papers. @askalphaxiv You can post questions and comments directly on top of any arXiv paper by changing arXiv to alphaXiv in any URL!

133

2.0K

7.0K

3.0K

881.0K

Zac Kenton Retweeted

Neel Nanda@NeelNanda5 · Jul 31

Sparse Autoencoders act like a microscope for AI internals. They're a powerful tool for interpretability, but training costs limit research Announcing Gemma Scope: An open suite of SAEs on every layer & sublayer of Gemma 2 2B & 9B! We hope to enable even more ambitious work

150

1.0K

595

210.0K

Zac Kenton Retweeted

Anca Dragan@ancadianadragan · Jul 31

SAEs can be like a microscope for AI inner workings, but they still need a lot of research. To help with that, today we’re sharing GemmaScope: an open suite of hundreds of SAEs on every layer and sublayer of Gemma 2. I’m excited about this for my academic colleagues interested in…

150

40.0K

Zac Kenton@ZacKenton1 · Jul 29

Gemini 1.5 Pro is the safest model on the Scale Adversarial Robustness Leaderboard! We’ve made a number of innovations -- which importantly also led to improved helpfulness -- but the key is making safety a core priority for the entire team, not an afterthought. Read more about…

AAlexandr Wang@alexandr_wang · Jul 29

1/ Scale is announcing our latest SEAL Leaderboard on Adversarial Robustness! 🛡️ Red team-generated prompts 🎯 Focused on universal harm scenarios 🔍 Transparent eval methods SEAL evals are private (not overfit), expert evals that refresh periodically scale.com/leaderboard

166

67.0K

Zac Kenton Retweeted

Cas (Stephen Casper)@StephenLCasper · Jul 23, 2024

🚨New paper: Targeted LAT Improves Robustness to Persistent Harmful Behaviors in LLMs ✅ Improved jailbreak robustness (incl. beating R2D2 with 35x less compute) ✅ Backdoor removal (i.e. solving the “sleeper agent” problem) ✅ Improved unlearning (incl. re-learning robustness)

178

29.0K

Zac Kenton@ZacKenton1 · Jul 22, 2024

excited to announce this received an “ICML Best Paper Award”! come see our talk at 10:30 tomorrow

aakbir.@akbirkhan · Feb 7, 2024

How can we check LLM outputs in domains where we are not experts? We find that non-expert humans answer questions better after reading debates between expert LLMs. Moreover, human judges are more accurate as experts get more persuasive. 📈 github.com/ucl-dark/llm_d…

322

107

75.0K

Zac Kenton Retweeted

Iván Arcuschin@IvanArcus · Jul 22, 2024

Circuit discovery techniques aim to find subgraphs of NNs for specific tasks. Are they correct? Which one is the best? 🕵️ Introducing InterpBench: 17 semi-synthetic, realistic transformers with known circuits to evaluate mechanistic interpretability. Read on... 🧵

5.0K