Joseph Bloom

@JBloomAus

White Box Evaluations Lead @ UK AI Safety Institute. Open Source Mechanistic Interpretability. MATS 6.0. ARENA 1.0.

Oxford, England

Joined February 2021

252Following

516Followers

Pinned

Joseph Bloom@JBloomAus · Jul 10

🧵 1/13 My new team at UK AISI - the White Box Control Team - has released progress updates! We've been investigating whether AI systems could deliberately underperform on evaluations without us noticing. Key findings below 👇

AAI Security Institute@AISecurityInst · Jul 10

We’ve released a detailed progress update on our white box control work so far! Read it here: alignmentforum.org/posts/pPEeMdgj…

10.0K

Joseph Bloom@JBloomAus · Jul 16

Was very happy to contribute to this position paper which makes an important but under-appreciated point! We should not take CoT monitorability for granted!

MMikita Balesni 🇺🇦@balesni · Jul 15

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…

297

Joseph Bloom Retweeted

Ed Turner@EdTurner42 · Jun 16

1/8: The Emergent Misalignment paper showed LLMs trained on insecure code then want to enslave humanity...?! We're releasing two papers exploring why! We: - Open source small clean EM models - Show EM is driven by a single evil vector - Show EM has a mechanistic phase transition

239

193

62.0K

Joseph Bloom Retweeted

Owain Evans@OwainEvans_UK · Jun 8

Podcast interview with @dfrsrchtwts on emergent misalignment, introspection, and self-awareness in LLMs. We dig into three recent papers from my group and Daniel asks many insightful and probing questions.

2.0K

Joseph Bloom Retweeted

Anthropic@AnthropicAI · May 29

Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-sourcing the method. Researchers can generate “attribution graphs” like those in our study, and explore them interactively.

118

582

5.0K

2.0K

755.0K

Joseph Bloom Retweeted

Eliezer Yudkowsky ⏹️@ESYudkowsky · May 14

Nate Soares and I are publishing a traditional book: _If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All_. Coming in Sep 2025. You should probably read it! Given that, we'd like you to preorder it! Nowish!

227

355

2.0K

458

923.0K

Joseph Bloom Retweeted

Benjamin Hilton@benjamin_hilton · May 7

The Alignment Team @AISecurityInst now has a research agenda. Our goal: solve the alignment problem. How: develop concrete, parallelisable open problems. Our initial focus is on asymptotic honesty guarantees (more details in the post). 1/5

8.0K