Can Rager

@can_rager

AI Explainability | Physics

Joined September 2023

70Following

244Followers

Pinned

Can Rager@can_rager · Mar 21

This is a good place to mention - we hosted NEMI last year in New England, a regional interpretability meetup, with 100+ attendees. People outside the region welcome to join. We will likely do it again this year!

NNeel Nanda@NeelNanda5 · Mar 21

I'm curious - if you were hoping to go to the ICML mechanistic interpretability workshop before it got cancelled, would you go to a standalone mechanistic interpretability workshop/conference?

2.0K

Can Rager@can_rager · Jul 3

Finally a workshop for the Mech Interp community!

KKoyena Pal@kpal_koyena · Jun 30

🚨 Registration is live! 🚨 The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University! A chance for the mech interp community to nerd out on how models really work 🧠🤖 🌐 Info: nemiconf.github.io/summer25/ 📝 Register:…

150

Can Rager@can_rager · May 15

Exciting mech interp on toy reasoning models!

AAndrew Lee@a_jy_l · May 13

🚨New preprint! How do reasoning models verify their own CoT? We reverse-engineer LMs and find critical components and subspaces needed for self-verification! 1/n

229

Can Rager Retweeted

David Bau@davidbau · Mar 17

Why is interpretability the key to dominance in AI? Not winning the scaling race, or banning China. Our answer to OSTP/NSF, w/ Goodfire's @banburismus_ Transluce's @cogconfluence MIT's @dhadfieldmenell resilience.baulab.info/docs/AI_Action… Here's why:🧵 ↘️

311

178

36.0K

Can Rager Retweeted

Adam Karvonen@a_karvonen · Mar 14

We're excited to announce the release of SAE Bench 1.0, our suite of Sparse Autoencoder (SAE) evaluations! We have also trained / evaluated a suite of open-source SAEs across 7 architectures. This has led to exciting new qualitative findings! Our findings in the 🧵 below 👇

191

112

36.0K

Can Rager@can_rager · Mar 13

New paper with @j_treutlein , @EvanHub , and many other coauthors! We train a model with a hidden misaligned objective and use it to run an auditing game: Can other teams of researchers uncover the model’s objective? x.com/AnthropicAI/st…

AAnthropic@AnthropicAI · Mar 13

New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?

138

35.0K