Erik Jenner

@jenner_erik

Research scientist @ Google DeepMind working on AGI safety & alignment

Joined April 2016

152Following

904Followers

Pinned

Erik Jenner@jenner_erik · Dec 13

How robust are LLM latent-space defenses, like monitoring with SAEs, probes, or OOD detectors? We adversarially stress-tested these methods and found they’re overall very vulnerable. But there are also some interesting exceptions 🧵

LLuke Bailey@LukeBailey181 · Dec 13

Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵

4.0K

Erik Jenner@jenner_erik · Jul 15

The fact that current LLMs have reasonably legible chain of thought is really useful for safety (as well as for other reasons)! It would be great to keep it this way

MMikita Balesni 🇺🇦@balesni · Jul 15

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…

430

Erik Jenner@jenner_erik · Jul 9

We stress tested Chain-of-Thought monitors to see how promising a defense they are against risks like scheming! I think the results are promising for CoT monitoring, and I'm very excited about this direction. But we should keep stress-testing defenses as models get more capable

SScott Emmons@emmons_scott · Jul 9

Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models…

390

Erik Jenner Retweeted

Victoria Krakovna@vkrakovna · Jul 8

As models advance, a key AI safety concern is deceptive alignment / "scheming" – where AI might covertly pursue unintended goals. Our paper "Evaluating Frontier Models for Stealth and Situational Awareness" assesses whether current models can scheme. arxiv.org/abs/2505.01420

212

83.0K

Erik Jenner Retweeted

David Lindner@davlindner · Jul 4

Can frontier models hide secret information and reasoning in their outputs? We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵

102

17.0K

Erik Jenner@jenner_erik · Jun 21

My @MATSprogram scholar Rohan just finished a cool paper on attacking latent-space probes with RL! Going in, I was unsure whether RL could explore into probe bypassing policies, or change the activations enough. Turns out it can, but not always. Go check out the thread & paper!

RRohan Gupta@RohDGupta · Jun 20

🧵 Can language models learn to evade latent-space defences via RL? We test whether probes are robust to optimisation pressure from reinforcement learning. We show that RL can break probes, but only in certain cases. Read on to find out when and why this happens!

453

Erik Jenner Retweeted

Sarah Cogan@sarah_cogan · Jun 17

Gemini 2⃣.5⃣ technical report is out!! 🙌🥂😇

892

Erik Jenner@jenner_erik · Jun 15

New episode with @davlindner, covering his work on MONA! Check it out - video link in reply.

AAXRP - the AI X-risk Research Podcast@AXRPodcast · Jun 15

Episode 43 - David Lindner on Myopic Optimization with Non-myopic Approval axrp.net/episode/2025/0…

3.0K

Erik Jenner Retweeted

Buck Shlegeris@bshlgrs · Apr 16

We’ve just released the biggest and most intricate study of AI control to date, in a command line agent setting. IMO the techniques studied are the best available option for preventing misaligned early AGIs from causing sudden disasters, e.g. hacking servers they’re working on.

248

22.0K

Erik Jenner@jenner_erik · Apr 4

IMO, this isn't much of an update against CoT monitoring hopes. They show unfaithfulness when the reasoning is minimal enough that it doesn't need CoT. But, my hopes for CoT monitoring are because models will have to reason a lot to end up misaligned and cause huge problems. 🧵

AAnthropic@AnthropicAI · Apr 3

New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.

158

14.0K

Erik Jenner@jenner_erik · Apr 6

UK AISI is hiring, consider applying if you're interested in adversarial ML/red-teaming. Seems like a great team, and I think it's one of the best places in the world for doing adversarial ML work that's highly impactful

XXander Davies@alxndrdavies · Mar 12

My team is hiring @AISecurityInst! I think this is one of the most important times in history to have strong technical expertise in government. Join our team understanding and fixing weaknesses in frontier models through sota adversarial ML research & testing. 🧵 1/4

3.0K

Erik Jenner@jenner_erik · Mar 21

Consider applying for MATS if you're interested to work on an AI alignment research project this summer! I'm a mentor as are many of my colleagues at DeepMind

RRyan Kidd@ryan_kidd44 · Mar 20

@MATSprogram Summer 2025 applications close Apr 18! Come help advance the fields of AI alignment, security, and governance with mentors including @NeelNanda5 @EthanJPerez @OwainEvans_UK @EvanHub @bshlgrs @dawnsongtweets @DavidSKrueger @RichardMCNgo and more!

1.0K

Erik Jenner Retweeted

Cas (Stephen Casper) @ ICML@StephenLCasper · Feb 14

🚨 New @iclr_conf blog post: Pitfalls of Evidence-Based AI Policy Everyone agrees: evidence is key for policymaking. But that doesn't mean we should postpone AI regulation. Instead of "Evidence-Based AI Policy," we need "Evidence-Seeking AI Policy." arxiv.org/abs/2502.09618…

123

11.0K

Erik Jenner Retweeted

Max Nadeau@MaxNadeau_ · Feb 6

🧵 Announcing @open_phil's Technical AI Safety RFP! We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable.

251

183

80.0K

Erik Jenner Retweeted

Impact Academy@aisafetyfellows · Dec 24

🚀 Applications for the Global AI Safety Fellowship 2025 are closing on 31 December 2025! We're looking for exceptional STEM talent from around the world who can advance the safe and beneficial development of AI. Fellows will get to work full-time with leading organisations in…

396

Erik Jenner Retweeted

Buck Shlegeris@bshlgrs · Dec 20

🧵 New paper: Previous work on AI control focused on whether models can execute attack strategies. In our new work, we assess their ability to generate strategies.

4.0K