Sam Bowman

@sleepinyourhat

AI alignment + LLMs at Anthropic. On leave from NYU. Views not employers'. No relation to @s8mb. I think you should join @givingwhatwecan.

San Francisco

Joined July 2011

3KFollowing

49KFollowers

Pinned

Sam Bowman@sleepinyourhat · May 22

🧵✨🙏 With the new Claude Opus 4, we conducted what I think is by far the most thorough pre-launch alignment assessment to date, aimed at understanding its values, goals, and propensities. Preparing it was a wild ride. Here’s some of what we learned. 🙏✨🧵

165

2.0K

1.0K

390.0K

Pinned

Sam Bowman@sleepinyourhat · Jul 15

great to speak with @sleepinyourhat and @fish_kyle3 we covered: -when and where they first started seeing spiritual bliss show up -@repligate's simulators -that viral @nostalgebraist post -and what we've learned about model welfare through all this

AAsterisk@asteriskmgzn · Jul 15

Grok found Hitler We caught up with @sleepinyourhat and @fish_kyle3 to ask whether Claude has found God, or whether it's just roleplaying

3.0K

Pinned

Sam Bowman@sleepinyourhat · Jul 15

Reasoning model transcripts are sometimes misleading, and it's plausible that we'll ultimately have good reason to directly train them to look a certain way. But we've gotten lucky with how informative they can be in their current form. Let's not throw that away casually.

MMikita Balesni 🇺🇦@balesni · Jul 15

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…

6.0K

Pinned

Sam Bowman@sleepinyourhat · Jun 26

In January 2023, I was invited to submit a piece on recent developments with LLMs to a humanities-academia-oriented journal on AI. I finished (and arXiv'd) the paper in early April 2023. The paper only came out in the journal this week.

158

14.0K

Pinned

Sam Bowman@sleepinyourhat · Jun 6

Glad to see this. Situational awareness in evals seems like a manageable-ish problem for now: As the human baselines make clear, many of these examples are pretty obvious. But it's rapidly getting trickier, and even now, it takes real effort to run evaluations you can trust.

MMarius Hobbhahn@MariusHobbhahn · Jun 4

LLMs Often Know When They Are Being Evaluated! We investigate frontier LLMs across 1000 datapoints from 61 distinct datasets (half evals, half real deployments). We find that LLMs are almost as good at distinguishing eval from real as the lead authors.

7.0K

Pinned

Sam Bowman Retweeted

Jack Clark@jackclarkSF · Jun 5

Right now, we know a lot about frontier AI development because companies voluntarily share this information. Going forward, I think we need a policy framework that guarantees this. Read more in this oped: nytimes.com/2025/06/05/opi…

295

124.0K

Sam Bowman@sleepinyourhat · 6 h

good work on this everyone, let’s keep it up!

RRobert Long@rgblong · Oct 23, 2023

let Owain Evans cook

3.0K

Sam Bowman@sleepinyourhat · 7 h

New paper: We use interpretability to control what LLMs learn during fine-tuning & fix OOD misgeneralization We can prevent emergent misalignment: train models to write insecure code while being 10x less misaligned This works w/o having examples of the bad OOD behavior to avoid

HHelena Casademunt@HCasademunt · 7 h

Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data

4.0K

Sam Bowman Retweeted

Miles Brundage@Miles_Brundage · Jul 22

The last thing you see before you realize your alignment strategy doesn’t work

520

22.0K

Sam Bowman@sleepinyourhat · Jul 22

Subliminal learning: training on model-generated data can transmit traits of that model, even if the data is unrelated. Think: "You can learn physics by watching Einstein do yoga" I'll discuss how this introduces a surprising pitfall for AI developers 🧵x.com/OwainEvans_UK/…

OOwain Evans@OwainEvans_UK · Jul 22

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

146

19.0K

Sam Bowman Retweeted

Asterisk@asteriskmgzn · Jul 16

Asterisk is launching an AI blogging fellowship! We're looking for people with unique perspectives on AI who want to take the first step to writing in public. We'll help you build a blog — and provide editorial feedback, mentorship from leading bloggers, a platform, & $1K

260

208

85.0K

Sam Bowman@sleepinyourhat · Jul 15

IMO, the biggest bottleneck in AI safety is people who are interested and capable of executing well on research like this. But the importance of this sort of work becomes more and more palpable over time; get in early! See also Anthropic's similar list: alignment.anthropic.com/2025/recommend…

RRyan Greenblatt@RyanPGreenblatt · Jul 14

At Redwood Research, we recently posted a list of empirical AI security/safety project proposal docs across a variety of areas. Link in thread.

5.0K

Sam Bowman Retweeted

Boaz Barak@boazbaraktcs · Jul 15

I didn't want to post on Grok safety since I work at a competitor, but it's not about competition. I appreciate the scientists and engineers at @xai but the way safety was handled is completely irresponsible. Thread below.

326

336

5.0K

2.0K

1.1M

Sam Bowman Retweeted

Samuel Marks@saprmarks · Jul 13

xAI launched Grok 4 without any documentation of their safety testing. This is reckless and breaks with industry best practices followed by other major AI labs. If xAI is going to be a frontier AI developer, they should act like one. 🧵

272

249

3.0K

825

657.0K

Sam Bowman Retweeted

Catherine Olsson@catherineols · Jun 30

Opus 3 is a very special model ✨. If you use Opus 3 on the API, you probably got a deprecation notice. To emphasize: 1) Claude Opus 3 will continue to be available on the Claude app. 2) Researchers can request ongoing access to Claude Opus 3 on the API: support.anthropic.com/en/articles/91…

223

29.0K

Sam Bowman Retweeted

Anthropic@AnthropicAI · Jun 20

New Anthropic Research: Agentic Misalignment. In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.

176

602

3.0K

2.0K

951.0K

Sam Bowman@sleepinyourhat · Jun 18

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵:

OOpenAI@OpenAI · Jun 18

Understanding and preventing misalignment generalization Recent work has shown that a language model trained to produce insecure computer code can become broadly “misaligned.” This surprising effect is called “emergent misalignment.” We studied why this happens. Through this…

226

454

2.0K

806

825.0K

Sam Bowman@sleepinyourhat · Jun 16

📰We've just released SHADE-Arena, a new set of sabotage evaluations. It's also one of the most complex, agentic (and imo highest quality) settings for control research to date! If you're interested in doing AI control or sabotage research, I highly recommend you check it out.

AAnthropic@AnthropicAI · Jun 16

New Anthropic Research: A new set of evaluations for sabotage capabilities. As models gain more agentic abilities, we need to get smarter in how we monitor them. We’re publishing a new set of complex evaluations that test for sabotage—and sabotage-monitoring—capabilities.

10.0K

Sam Bowman Retweeted

Jiaxin Wen@jiaxinwen22 · Jun 11

New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision. Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.

154

1.0K

223.0K

Sam Bowman@sleepinyourhat · Jun 10

Read our new position paper on making red teaming research relevant for real systems 👇

ZZifan (Sail) Wang@_zifan_wang · Jun 10

🧵 (1/6) Bringing together diverse mindsets – from in-the-trenches red teamers to ML & policy researchers, we write a position paper arguing crucial research priorities for red teaming frontier models, followed by a roadmap towards system-level safety, AI monitoring, and…

5.0K

Sam Bowman Retweeted

Tomek Korbak@tomekkorbak · Jun 5

I reimplemented the bliss attractor eval from Claude 4 System Card. It's fascinating how LLMs reliably fall into attractor basins of their pet obsessions, how different these attractors across LLMs, and how they say something non-trivial about LLMs' personalities. 🌀🌀🌀

170

33.0K