Joseph Bloom
@JBloomAus
White Box Evaluations Lead @ UK AI Safety Institute. Open Source Mechanistic Interpretability. MATS 6.0. ARENA 1.0.
🧵 1/13 My new team at UK AISI - the White Box Control Team - has released progress updates! We've been investigating whether AI systems could deliberately underperform on evaluations without us noticing. Key findings below 👇
We’ve released a detailed progress update on our white box control work so far! Read it here: alignmentforum.org/posts/pPEeMdgj…
Was very happy to contribute to this position paper which makes an important but under-appreciated point! We should not take CoT monitorability for granted!
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…
1/8: The Emergent Misalignment paper showed LLMs trained on insecure code then want to enslave humanity...?! We're releasing two papers exploring why! We: - Open source small clean EM models - Show EM is driven by a single evil vector - Show EM has a mechanistic phase transition
Podcast interview with @dfrsrchtwts on emergent misalignment, introspection, and self-awareness in LLMs. We dig into three recent papers from my group and Daniel asks many insightful and probing questions.
Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-sourcing the method. Researchers can generate “attribution graphs” like those in our study, and explore them interactively.
Nate Soares and I are publishing a traditional book: _If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All_. Coming in Sep 2025. You should probably read it! Given that, we'd like you to preorder it! Nowish!
The Alignment Team @AISecurityInst now has a research agenda. Our goal: solve the alignment problem. How: develop concrete, parallelisable open problems. Our initial focus is on asymptotic honesty guarantees (more details in the post). 1/5