Can Rager
@can_rager
AI Explainability | Physics
This is a good place to mention - we hosted NEMI last year in New England, a regional interpretability meetup, with 100+ attendees. People outside the region welcome to join. We will likely do it again this year!
I'm curious - if you were hoping to go to the ICML mechanistic interpretability workshop before it got cancelled, would you go to a standalone mechanistic interpretability workshop/conference?
Finally a workshop for the Mech Interp community!
🚨 Registration is live! 🚨 The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University! A chance for the mech interp community to nerd out on how models really work 🧠🤖 🌐 Info: nemiconf.github.io/summer25/ 📝 Register:…
Exciting mech interp on toy reasoning models!
🚨New preprint! How do reasoning models verify their own CoT? We reverse-engineer LMs and find critical components and subspaces needed for self-verification! 1/n
Why is interpretability the key to dominance in AI? Not winning the scaling race, or banning China. Our answer to OSTP/NSF, w/ Goodfire's @banburismus_ Transluce's @cogconfluence MIT's @dhadfieldmenell resilience.baulab.info/docs/AI_Action… Here's why:🧵 ↘️
We're excited to announce the release of SAE Bench 1.0, our suite of Sparse Autoencoder (SAE) evaluations! We have also trained / evaluated a suite of open-source SAEs across 7 architectures. This has led to exciting new qualitative findings! Our findings in the 🧵 below 👇
New paper with @j_treutlein , @EvanHub , and many other coauthors! We train a model with a hidden misaligned objective and use it to run an auditing game: Can other teams of researchers uncover the model’s objective? x.com/AnthropicAI/st…
New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?