Chris Olah

@ch402

Reverse engineering neural networks at @AnthropicAI. Previously @distillpub, OpenAI Clarity Team, Google Brain. Personal account.

San Francisco, CA

Joined June 2010

179Following

119KFollowers

Pinned

Chris Olah Retweeted

Sam Bowman@sleepinyourhat · May 22

🧵✨🙏 With the new Claude Opus 4, we conducted what I think is by far the most thorough pre-launch alignment assessment to date, aimed at understanding its values, goals, and propensities. Preparing it was a wild ride. Here’s some of what we learned. 🙏✨🧵

165

2.0K

1.0K

391.0K

Chris Olah Retweeted

Jack Lindsey@Jack_W_Lindsey · Jul 23

We're launching an "AI psychiatry" team as part of interpretability efforts at Anthropic! We'll be researching phenomena like model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors. We're hiring - join us! job-boards.greenhouse.io/anthropic/jobs…

153

199

2.0K

1.0K

312.0K

Chris Olah Retweeted

Anthropic@AnthropicAI · Jul 24

New Anthropic research: Building and evaluating alignment auditing agents. We developed three AI agents to autonomously complete alignment auditing tasks. In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.

183

1.0K

697

325.0K

Chris Olah@ch402 · Jul 15

If you don't train your CoTs to look nice, you could get some safety from monitoring them. This seems good to do! But I'm skeptical this will work reliably enough to be load-bearing in a safety case. Plus as RL is scaled up, I expect CoTs to become less and less legible.

MMikita Balesni 🇺🇦@balesni · Jul 15

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…

310

107

71.0K

Chris Olah Retweeted

Barack Obama@BarackObama · May 30

At a time when people are understandably focused on the daily chaos in Washington, these articles describe the rapidly accelerating impact that AI is going to have on jobs, the economy, and how we live. axios.com/2025/05/28/ai-…

4.0K

9.0K

42.0K

12.0K

8.9M

Chris Olah Retweeted

Anthropic@AnthropicAI · May 29

Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-sourcing the method. Researchers can generate “attribution graphs” like those in our study, and explore them interactively.

118

583

5.0K

2.0K

754.0K

Chris Olah@ch402 · May 29

@mntssys and I are excited to announce circuit-tracer, a library that makes circuit-finding simple! Just type in a sentence, and get out a circuit showing (some of) the features your model uses to predict the next token. Try it on @neuronpedia: shorturl.at/SUX2A

AAnthropic@AnthropicAI · May 29

215

59.0K

Chris Olah@ch402 · May 15

My three-sentence summary of Lakatos's "Proofs and Refutations", with apologies to Don Knuth: "Premature definition is the root of much conceptual evil. Good definitions arise out of a back-and-forth interplay between rough definitions and powerful insight-giving arguments.…

MMichael Nielsen@michael_nielsen · Jul 30, 2022

Commit to definitions as late as possible, but not later. My Ode to Lakatos: cognitivemedium.com/trouble_with_d…

138

31.0K