Ethan Perez

@EthanJPerez

Large language model safety

Joined September 2017

592Following

10KFollowers

Pinned

Ethan Perez@EthanJPerez · Aug 13

My team built a system we think might be pretty jailbreak resistant, enough to offer up to $15k for a novel jailbreak. Come prove us wrong!

AAnthropic@AnthropicAI · Aug 8

We're expanding our bug bounty program. This new initiative is focused on finding universal jailbreaks in our next-generation safety system. We're offering rewards for novel vulnerabilities across a wide range of domains, including cybersecurity. anthropic.com/news/model-saf…

292

66.0K

Pinned

Ethan Perez Retweeted

Miles Turpin@milesaturpin · Jul 14

New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).

276

136

22.0K

Ethan Perez Retweeted

Helena Casademunt@HCasademunt · 13 h

Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data

13.0K

Ethan Perez@EthanJPerez · Jul 22

In a joint paper with @OwainEvans_UK as part of the Anthropic Fellows Program, we study a surprising phenomenon: subliminal learning. Language models can transmit their traits to other models, even in what appears to be meaningless data. x.com/OwainEvans_UK/…

OOwain Evans@OwainEvans_UK · Jul 22

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

156

1.0K

471

180.0K

Ethan Perez Retweeted

Owain Evans@OwainEvans_UK · Jul 22

Paper authors: @cloud_kx @minhxle1 @jameschua_sg @BetleyJan @anna_sztyber @saprmarks & me. Arxiv pdf: arxiv.org/abs/2507.14805 Blogpost: alignment.anthropic.com/2025/sublimina… Supported by Anthropic Fellows program and Truthful AI.

471

164

38.0K

Ethan Perez Retweeted

Owain Evans@OwainEvans_UK · Jul 22

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

244

982

8.0K

5.0K

1.4M

Ethan Perez Retweeted

Aryo Pradipta Gema@aryopg · Jul 22

New Anthropic Research: “Inverse Scaling in Test-Time Compute” We found cases where longer reasoning leads to lower accuracy. Our findings suggest that naïve scaling of test-time compute may inadvertently reinforce problematic reasoning patterns. 🧵

127

963

543

133.0K

Ethan Perez Retweeted

Bowen Baker@bobabowen · Jul 15

Modern reasoning models think in plain English. Monitoring their thoughts could be a powerful, yet fragile, tool for overseeing future AI systems. I and researchers across many organizations think we should work to evaluate, preserve, and even improve CoT monitorability.

150

791

503

681.0K

Ethan Perez Retweeted

Mikita Balesni 🇺🇦@balesni · Jul 15

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…

102

416

242

196.0K

Ethan Perez Retweeted

Samuel Marks@saprmarks · Jul 13

xAI launched Grok 4 without any documentation of their safety testing. This is reckless and breaks with industry best practices followed by other major AI labs. If xAI is going to be a frontier AI developer, they should act like one. 🧵

272

249

3.0K

825

657.0K

Ethan Perez Retweeted

Scott Emmons@emmons_scott · Jul 9

Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models…

170

64.0K

Ethan Perez Retweeted

Anthropic@AnthropicAI · Jul 8

New Anthropic research: Why do some language models fake alignment while others don't? Last year, we found a situation where Claude 3 Opus fakes alignment. Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex.

270

2.0K

1.0K

450.0K

Ethan Perez Retweeted

David Lindner@davlindner · Jul 4

Can frontier models hide secret information and reasoning in their outputs? We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵

102

17.0K

Ethan Perez Retweeted

Michel@JustenMichel · Jun 26

really interesting to see just how gendered excitement about AI is, even among AI experts

248

63.0K

Ethan Perez@EthanJPerez · Jun 21

All frontier models are down to blackmail to avoid getting shut down

AAnthropic@AnthropicAI · Jun 20

New Anthropic Research: Agentic Misalignment. In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.

5.0K

Ethan Perez Retweeted

Anthropic@AnthropicAI · Jun 20

176

602

3.0K

2.0K

951.0K

Ethan Perez Retweeted

OpenAI@OpenAI · Jun 18

Understanding and preventing misalignment generalization Recent work has shown that a language model trained to produce insecure computer code can become broadly “misaligned.” This surprising effect is called “emergent misalignment.” We studied why this happens. Through this…

357

486

3.0K

843

1.2M

Ethan Perez@EthanJPerez · Jun 18

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵:

OOpenAI@OpenAI · Jun 18

226

454

2.0K

806

825.0K

Ethan Perez@EthanJPerez · Jun 15

Great work on discovering topics LLMs were trained not to discuss! Cool finding: after quantizing Perplexity's "decensored" version of R1, the censorship returns. Very parallel to alignment auditing, which also involves unsupervised search + downstream validation of behaviors

CCan Rager@can_rager · Jun 13

Can we uncover the list of topics a language model is censored on? Refused topics vary strongly among models. Claude-3.5 vs DeepSeek-R1 refusal patterns:

3.0K

Ethan Perez Retweeted

Can Rager@can_rager · Jun 13

Can we uncover the list of topics a language model is censored on? Refused topics vary strongly among models. Claude-3.5 vs DeepSeek-R1 refusal patterns:

13.0K

Ethan Perez Retweeted

Jerry Wei@JerryWeiAI · Jun 11

Today marks my one-year anniversary at Anthropic, and I've been reflecting on some of the most impactful lessons I've learned during this incredible journey. One of the most striking realizations has been just how much a small, talent-dense team can accomplish. When I first…

408

165

42.0K