Neel Nanda
@NeelNanda5
Mechanistic Interpretability lead DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!
After supervising 20+ papers, I have highly opinionated views on writing great ML papers. When I entered the field I found this all frustratingly opaque So I wrote a guide on turning research into high-quality papers with scientific integrity! Hopefully still useful for NeurIPS

Very cool work! Base models *can* backtrack, but often don't, a key CoT model skill. Turns out the choice to do it involves base model concepts, put to new use! Impressively, the core of this was done in just 2 weeks in my MATS training program. New applications open this week!
Do reasoning models like DeepSeek R1 learn their behavior from scratch? No! In our new paper, we extract steering vectors from a base model that induce backtracking in a distilled reasoning model, but surprisingly have no apparent effect on the base model itself! 🧵 (1/5)
There is still an opportunity for @OpenAI to live up to its founding promises, instead of abandoning them. Here I explain what this could look like.
CS 2881 by @boazbaraktcs is the University course I'm most excited about in a while. Even better it features @EdTurner42 and @NeelNanda5 paper about Emergent Misalignment. Anyone interested in AI Safety should follow along. windowsontheory.org/2025/07/20/ai-…
Chain of Thought (CoT) monitoring could be a powerful tool for overseeing future AI systems—especially as they become more agentic. That’s why we’re backing a new research paper from a cross-institutional team of researchers pushing this work forward.
Modern reasoning models think in plain English. Monitoring their thoughts could be a powerful, yet fragile, tool for overseeing future AI systems. I and researchers across many organizations think we should work to evaluate, preserve, and even improve CoT monitorability.
Modern reasoning models think in plain English. Monitoring their thoughts could be a powerful, yet fragile, tool for overseeing future AI systems. I and researchers across many organizations think we should work to evaluate, preserve, and even improve CoT monitorability.
Go check out Ed and Anna's great work at the ICML actionable interpretability workshop today! (And if you want to replicate their great fashion choices, check out interp[dot]shop)
@EdTurner42 and I are at ICML today presenting our posters on Emergent Misalignment! Come find us at the Actionable Interpretability Workshop and the R2FM Workshop. T-shirt creds to @NeelNanda5 :)