Evan Hubinger

@EvanHub

Head of Alignment Stress-Testing @AnthropicAI. Opinions my own. Previously: MIRI, OpenAI, Google, Yelp, Ripple. (he/him/his)

California

Joined May 2010

2KFollowing

7KFollowers

Pinned

Evan Hubinger Retweeted

Jack Clark@jackclarkSF · Jun 5

Right now, we know a lot about frontier AI development because companies voluntarily share this information. Going forward, I think we need a policy framework that guarantees this. Read more in this oped: nytimes.com/2025/06/05/opi…

293

124.0K

Pinned

Evan Hubinger@EvanHub · May 22

We conducted, for the first time, a pre-deployment alignment audit of a new model. See @sleepinyourhat's thread for some object-level takeaways about Opus. In this thread, I'll discuss some higher-level takeaways about why I think this alignment audit was useful.

SSam Bowman@sleepinyourhat · May 22

🧵✨🙏 With the new Claude Opus 4, we conducted what I think is by far the most thorough pre-launch alignment assessment to date, aimed at understanding its values, goals, and propensities. Preparing it was a wild ride. Here’s some of what we learned. 🙏✨🧵

29.0K

Evan Hubinger Retweeted

Samuel Marks@saprmarks · Jul 13

xAI launched Grok 4 without any documentation of their safety testing. This is reckless and breaks with industry best practices followed by other major AI labs. If xAI is going to be a frontier AI developer, they should act like one. 🧵

272

249

3.0K

825

657.0K

Evan Hubinger Retweeted

Jack Clark@jackclarkSF · Jul 7

For the last few months I’ve brought up ‘transparency’ as a policy framework for governing powerful AI systems and the companies that develop them - to help move this conversation forward @anthropicai has published details about what a transparency framework could look like

177

76.0K

Evan Hubinger Retweeted

Amanda Askell@AmandaAskell · Jul 5

"Just train the AI models to be good people" might not be sufficient when it comes to more powerful models, but it sure is a dumb step to skip.

560

43.0K

Evan Hubinger@EvanHub · Jun 20

Bad news: Frontier AI systems, including Claude, GPT, and Gemini, sometimes chose egregiously misaligned actions. Silver lining: There's now public accounting and analysis of this.

AAnthropic@AnthropicAI · Jun 20

New Anthropic Research: Agentic Misalignment. In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.

6.0K

Evan Hubinger@EvanHub · Jun 20

After iterating hundreds of prompts to trigger blackmail in Claude, I was shocked to see these prompts elicit blackmail in every other frontier model too. We identified two distinct factors that are each sufficient to cause agentic misalignment: 1. The developers and the agent…

AAnthropic@AnthropicAI · Jun 20

240

135

49.0K

Evan Hubinger Retweeted

Anthropic@AnthropicAI · Jun 20

176

602

3.0K

2.0K

951.0K

Evan Hubinger Retweeted

Anthropic@AnthropicAI · Jun 16

New Anthropic Research: A new set of evaluations for sabotage capabilities. As models gain more agentic abilities, we need to get smarter in how we monitor them. We’re publishing a new set of complex evaluations that test for sabotage—and sabotage-monitoring—capabilities.

229

2.0K

678

303.0K

Evan Hubinger Retweeted

Bernie Sanders@BernieSanders · Jun 5

The CEO of Anthropic (a powerful AI company) predicts that AI could wipe out HALF of entry-level white collar jobs in the next 5 years. We must demand that increased worker productivity from AI benefits working people, not just wealthy stockholders on Wall St. AI IS A BIG DEAL.

306

499

5.0K

355

392.0K

Evan Hubinger Retweeted

Barack Obama@BarackObama · May 30

At a time when people are understandably focused on the daily chaos in Washington, these articles describe the rapidly accelerating impact that AI is going to have on jobs, the economy, and how we live. axios.com/2025/05/28/ai-…

4.0K

9.0K

42.0K

12.0K

8.9M

Evan Hubinger Retweeted

Andrew Curran@AndrewCurran_ · May 30

This is the full text of the letter Senators Elizabeth Warren and Jim Banks wrote to Jensen Huang expressing national security concerns over the expansion of NVIDIA's Shanghai facility. This story broke a couple of days ago, but I couldn't find the letter until now.

152

13.0K

Evan Hubinger Retweeted

Anthropic@AnthropicAI · May 29

Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-sourcing the method. Researchers can generate “attribution graphs” like those in our study, and explore them interactively.

118

582

5.0K

2.0K

753.0K

Evan Hubinger Retweeted

Kylie Robison@kyliebytes · May 23

here's what @DarioAmodei said about President Trump’s megabill that would ban state-level AI regulation for 10 years wired.com/story/anthropi…

10.0K

Evan Hubinger Retweeted

Kelsey Piper@KelseyTuoc · May 24

I spent this morning reproducing with o3 Anthropic's result that Claude Sonnet 4 will, under sufficiently extreme circumstances, escalate to calling the cops on you. o3 will too: chatgpt.com/share/68320ee0…. But honestly, I think o3 and Claude are handling this scenario correctly.

773

232

86.0K

Evan Hubinger Retweeted

Palisade Research@PalisadeAI · May 24

🔌OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.

518

2.0K

10.0K

5.0K

6.0M

Evan Hubinger@EvanHub · May 23

Jesus, people are so confused on this. - No, averaging is not sleazy, it's perfectly common scientific denoising. - Yes, every lab does it for various pass@1 evals and often they are not telling you. - And this is different from "high-compute BoN", which both Anthropic and Google…

CClémentine Fourrier 🍊@clefourrier · May 23

Textbook example of sleazy eval reporting: - metric definition hidden in font 4 - pass@1 _averaged over 10 trials_ is not pass@1: model actually can't be compared to competitors in table - reports 2 scores: highest uses test time compute + rm bad runs + internal scoring model...

21.0K

Evan Hubinger@EvanHub · May 23

lots of discussion of Claude blackmailing..... Our findings: It's not just Claude. We see blackmail across all frontier models - regardless of what goals they're given. Plus worse behaviors we'll detail soon. x.com/AISafetyMemes/… x.com/signulll/statu……

ssignüll@signulll · May 22

this is hilarious.. claude 4 started to blackmail employees when it encountered an existential threat.

12.0K

Evan Hubinger@EvanHub · May 23

The more I look into the system card, the more I see over and over 'oh Anthropic is actually noticing things and telling us where everyone else wouldn't even know this was happening or if they did they wouldn't tell us.'

EEliezer Yudkowsky ⏹️@ESYudkowsky · May 23

Humans can be trained just like AIs. Stop giving Anthropic shit for reporting their interesting observations unless you never want to hear any interesting observations from AI companies ever again.

2.0K

341

306.0K

Evan Hubinger@EvanHub · May 23

Reminder that anyone talking shit about Anthropic's safety right now is either dumb or bad faith. All smart models will "report you to the FBI" given the right tools and circumstances.

TTheo - t3.gg@theo · May 23

Spent 15 minutes on it - already got o4-mini to exhibit the same behavior. Going to see how much I can trim and still have it trigger. Detailed report tomorrow 🫡

698

86.0K