Sam Bowman
@sleepinyourhat
AI alignment + LLMs at Anthropic. On leave from NYU. Views not employers'. No relation to @s8mb. I think you should join @givingwhatwecan.
🧵✨🙏 With the new Claude Opus 4, we conducted what I think is by far the most thorough pre-launch alignment assessment to date, aimed at understanding its values, goals, and propensities. Preparing it was a wild ride. Here’s some of what we learned. 🙏✨🧵
great to speak with @sleepinyourhat and @fish_kyle3 we covered: -when and where they first started seeing spiritual bliss show up -@repligate's simulators -that viral @nostalgebraist post -and what we've learned about model welfare through all this
Grok found Hitler We caught up with @sleepinyourhat and @fish_kyle3 to ask whether Claude has found God, or whether it's just roleplaying
Reasoning model transcripts are sometimes misleading, and it's plausible that we'll ultimately have good reason to directly train them to look a certain way. But we've gotten lucky with how informative they can be in their current form. Let's not throw that away casually.
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…
In January 2023, I was invited to submit a piece on recent developments with LLMs to a humanities-academia-oriented journal on AI. I finished (and arXiv'd) the paper in early April 2023. The paper only came out in the journal this week.
Glad to see this. Situational awareness in evals seems like a manageable-ish problem for now: As the human baselines make clear, many of these examples are pretty obvious. But it's rapidly getting trickier, and even now, it takes real effort to run evaluations you can trust.
LLMs Often Know When They Are Being Evaluated! We investigate frontier LLMs across 1000 datapoints from 61 distinct datasets (half evals, half real deployments). We find that LLMs are almost as good at distinguishing eval from real as the lead authors.
Right now, we know a lot about frontier AI development because companies voluntarily share this information. Going forward, I think we need a policy framework that guarantees this. Read more in this oped: nytimes.com/2025/06/05/opi…
good work on this everyone, let’s keep it up!
let Owain Evans cook
New paper: We use interpretability to control what LLMs learn during fine-tuning & fix OOD misgeneralization We can prevent emergent misalignment: train models to write insecure code while being 10x less misaligned This works w/o having examples of the bad OOD behavior to avoid
Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data
The last thing you see before you realize your alignment strategy doesn’t work
Subliminal learning: training on model-generated data can transmit traits of that model, even if the data is unrelated. Think: "You can learn physics by watching Einstein do yoga" I'll discuss how this introduces a surprising pitfall for AI developers 🧵x.com/OwainEvans_UK/…
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
Asterisk is launching an AI blogging fellowship! We're looking for people with unique perspectives on AI who want to take the first step to writing in public. We'll help you build a blog — and provide editorial feedback, mentorship from leading bloggers, a platform, & $1K
IMO, the biggest bottleneck in AI safety is people who are interested and capable of executing well on research like this. But the importance of this sort of work becomes more and more palpable over time; get in early! See also Anthropic's similar list: alignment.anthropic.com/2025/recommend…
At Redwood Research, we recently posted a list of empirical AI security/safety project proposal docs across a variety of areas. Link in thread.
I didn't want to post on Grok safety since I work at a competitor, but it's not about competition. I appreciate the scientists and engineers at @xai but the way safety was handled is completely irresponsible. Thread below.
xAI launched Grok 4 without any documentation of their safety testing. This is reckless and breaks with industry best practices followed by other major AI labs. If xAI is going to be a frontier AI developer, they should act like one. 🧵
Opus 3 is a very special model ✨. If you use Opus 3 on the API, you probably got a deprecation notice. To emphasize: 1) Claude Opus 3 will continue to be available on the Claude app. 2) Researchers can request ongoing access to Claude Opus 3 on the API: support.anthropic.com/en/articles/91…
New Anthropic Research: Agentic Misalignment. In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.
We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵:
Understanding and preventing misalignment generalization Recent work has shown that a language model trained to produce insecure computer code can become broadly “misaligned.” This surprising effect is called “emergent misalignment.” We studied why this happens. Through this…
📰We've just released SHADE-Arena, a new set of sabotage evaluations. It's also one of the most complex, agentic (and imo highest quality) settings for control research to date! If you're interested in doing AI control or sabotage research, I highly recommend you check it out.
New Anthropic Research: A new set of evaluations for sabotage capabilities. As models gain more agentic abilities, we need to get smarter in how we monitor them. We’re publishing a new set of complex evaluations that test for sabotage—and sabotage-monitoring—capabilities.
New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision. Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.
Read our new position paper on making red teaming research relevant for real systems 👇
🧵 (1/6) Bringing together diverse mindsets – from in-the-trenches red teamers to ML & policy researchers, we write a position paper arguing crucial research priorities for red teaming frontier models, followed by a roadmap towards system-level safety, AI monitoring, and…
I reimplemented the bliss attractor eval from Claude 4 System Card. It's fascinating how LLMs reliably fall into attractor basins of their pet obsessions, how different these attractors across LLMs, and how they say something non-trivial about LLMs' personalities. 🌀🌀🌀