Jack Lindsey
@Jack_W_Lindsey
Neuroscience of AI brains @AnthropicAI. Previously neuroscience of real brains @cu_neurotheory.
We're launching an "AI psychiatry" team as part of interpretability efforts at Anthropic! We'll be researching phenomena like model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors. We're hiring - join us! job-boards.greenhouse.io/anthropic/jobs…
Humans and animals can rapidly learn in new environments. What computations support this? We study the mechanisms of in-context reinforcement learning in transformers, and propose how episodic memory can support rapid learning. Work w/ @KanakaRajanPhD: arxiv.org/abs/2506.19686
Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-sourcing the method. Researchers can generate “attribution graphs” like those in our study, and explore them interactively.
The Anthropic Interpretability Team is planning a virtual Q&A to answer Qs about how we plan to make models safer, the role of the team at Anthropic, where we’re headed, and what it’s like to work here! Please let us know if you’d be interested forms.gle/VeZZVz1NFsArzS…
New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning? Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders This finds interpretable and causal chat-only features! 🧵
There’s at least a dozen dissertations to be written from this paper by Anthropic alone, which gives us some insight into how AIs “think” and reveal a lot of complexity and unexpected abilities, including generalization and planning. transformer-circuits.pub/2025/attributi…
I participated in this as an auditor, poking around in an LLM's brain to find its evil secrets. Most fun I've had at work! Very clever + thoughtful work by the lead authors in designing the model + the game, which set a precedent for how we can validate safety auditing techniques
New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
Big new review! 🟦Open Problems in Mechanistic Interpretability🟦 We bring together perspectives from ~30 top researchers to outline the current frontiers of mech interp. It highlights the open problems that we think the field should prioritize! 🧵