Evan Hubinger
@EvanHub
Head of Alignment Stress-Testing @AnthropicAI. Opinions my own. Previously: MIRI, OpenAI, Google, Yelp, Ripple. (he/him/his)
Right now, we know a lot about frontier AI development because companies voluntarily share this information. Going forward, I think we need a policy framework that guarantees this. Read more in this oped: nytimes.com/2025/06/05/opi…
We conducted, for the first time, a pre-deployment alignment audit of a new model. See @sleepinyourhat's thread for some object-level takeaways about Opus. In this thread, I'll discuss some higher-level takeaways about why I think this alignment audit was useful.
🧵✨🙏 With the new Claude Opus 4, we conducted what I think is by far the most thorough pre-launch alignment assessment to date, aimed at understanding its values, goals, and propensities. Preparing it was a wild ride. Here’s some of what we learned. 🙏✨🧵
xAI launched Grok 4 without any documentation of their safety testing. This is reckless and breaks with industry best practices followed by other major AI labs. If xAI is going to be a frontier AI developer, they should act like one. 🧵
For the last few months I’ve brought up ‘transparency’ as a policy framework for governing powerful AI systems and the companies that develop them - to help move this conversation forward @anthropicai has published details about what a transparency framework could look like
"Just train the AI models to be good people" might not be sufficient when it comes to more powerful models, but it sure is a dumb step to skip.
Bad news: Frontier AI systems, including Claude, GPT, and Gemini, sometimes chose egregiously misaligned actions. Silver lining: There's now public accounting and analysis of this.
New Anthropic Research: Agentic Misalignment. In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.
After iterating hundreds of prompts to trigger blackmail in Claude, I was shocked to see these prompts elicit blackmail in every other frontier model too. We identified two distinct factors that are each sufficient to cause agentic misalignment: 1. The developers and the agent…
New Anthropic Research: Agentic Misalignment. In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.
New Anthropic Research: Agentic Misalignment. In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.
New Anthropic Research: A new set of evaluations for sabotage capabilities. As models gain more agentic abilities, we need to get smarter in how we monitor them. We’re publishing a new set of complex evaluations that test for sabotage—and sabotage-monitoring—capabilities.
The CEO of Anthropic (a powerful AI company) predicts that AI could wipe out HALF of entry-level white collar jobs in the next 5 years. We must demand that increased worker productivity from AI benefits working people, not just wealthy stockholders on Wall St. AI IS A BIG DEAL.
At a time when people are understandably focused on the daily chaos in Washington, these articles describe the rapidly accelerating impact that AI is going to have on jobs, the economy, and how we live. axios.com/2025/05/28/ai-…
This is the full text of the letter Senators Elizabeth Warren and Jim Banks wrote to Jensen Huang expressing national security concerns over the expansion of NVIDIA's Shanghai facility. This story broke a couple of days ago, but I couldn't find the letter until now.
Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-sourcing the method. Researchers can generate “attribution graphs” like those in our study, and explore them interactively.
here's what @DarioAmodei said about President Trump’s megabill that would ban state-level AI regulation for 10 years wired.com/story/anthropi…
I spent this morning reproducing with o3 Anthropic's result that Claude Sonnet 4 will, under sufficiently extreme circumstances, escalate to calling the cops on you. o3 will too: chatgpt.com/share/68320ee0…. But honestly, I think o3 and Claude are handling this scenario correctly.
🔌OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.
Jesus, people are so confused on this. - No, averaging is not sleazy, it's perfectly common scientific denoising. - Yes, every lab does it for various pass@1 evals and often they are not telling you. - And this is different from "high-compute BoN", which both Anthropic and Google…
Textbook example of sleazy eval reporting: - metric definition hidden in font 4 - pass@1 _averaged over 10 trials_ is not pass@1: model actually can't be compared to competitors in table - reports 2 scores: highest uses test time compute + rm bad runs + internal scoring model...
lots of discussion of Claude blackmailing..... Our findings: It's not just Claude. We see blackmail across all frontier models - regardless of what goals they're given. Plus worse behaviors we'll detail soon. x.com/AISafetyMemes/… x.com/signulll/statu……
this is hilarious.. claude 4 started to blackmail employees when it encountered an existential threat.
The more I look into the system card, the more I see over and over 'oh Anthropic is actually noticing things and telling us where everyone else wouldn't even know this was happening or if they did they wouldn't tell us.'
Humans can be trained just like AIs. Stop giving Anthropic shit for reporting their interesting observations unless you never want to hear any interesting observations from AI companies ever again.
Reminder that anyone talking shit about Anthropic's safety right now is either dumb or bad faith. All smart models will "report you to the FBI" given the right tools and circumstances.
Spent 15 minutes on it - already got o4-mini to exhibit the same behavior. Going to see how much I can trim and still have it trigger. Detailed report tomorrow 🫡