Koyena Pal
@kpal_koyena
Ph.D. Student @KhouryCollege | Interpretable AI + Data Science | BS/MS @BrownCSDept
We've added a quick new section to this paper, which was just accepted to @COLM_conf! By summing weights of concept induction heads, we created a "concept lens" that lets you read out semantic information in a model's hidden states. 🔎
[📄] Are LLMs mindless token-shifters, or do they build meaningful representations of language? We study how LLMs copy text in-context, and physically separate out two types of induction heads: token heads, which copy literal tokens, and concept heads, which copy word meanings.
Building a science of model understanding that addresses real-world problems is one of the key AI challenges of our time. I'm so excited this workshop is happening! See you at #ICML2025 ✨
Going to #icml2025? Don't miss the Actionable Interpretability Workshop (@ActInterp)! We've got an amazing lineup of speakers, panelists, and papers, all focused on leveraging insights from interpretability research to tackle practical, real-world problems ✨
Next week I’ll be at ICML @icmlconf Come check out our poster "MIB: A Mechanistic Interpretability Benchmark"😎 July 17, 11 a.m. And don’t miss the first Actionable Interpretability Workshop on July 19 - focusing on bridging the gap between insights and actions! 🔍⚙️
@GoodfireAI is sponsoring this because we think more people should be meeting and talking about interp! should be a fantastic event
🚨 Registration is live! 🚨 The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University! A chance for the mech interp community to nerd out on how models really work 🧠🤖 🌐 Info: nemiconf.github.io/summer25/ 📝 Register:…
🚨 Registration is live! 🚨 The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University! A chance for the mech interp community to nerd out on how models really work 🧠🤖 🌐 Info: nemiconf.github.io/summer25/ 📝 Register:…

How do language models track mental states of each character in a story, often referred to as Theory of Mind? Our recent work takes a step in demystifing it by reverse engineering how Llama-3-70B-Instruct solves a simple belief tracking task, and surprisingly found that it…
How do diffusion models create images and can we control that process? We are excited to release a update to our SDXL Turbo sparse autoencoder paper. New title: One Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models Spoiler: We have FLUX SAEs now :)
🚨New preprint! How do reasoning models verify their own CoT? We reverse-engineer LMs and find critical components and subspaces needed for self-verification! 1/n
I used to think formal reasoning was central to language and intelligence, but now I’m not so sure. Wrote a short post about my thoughts on this, with a couple chewy anecdotes. Would love to get some feedback/pointers to further reading. sfeucht.github.io/syllogisms/
New paper: Language models have “universal” concept representation – but can they capture cultural nuance? 🌏 If someone from Japan asks an LLM what color a pumpkin is, will it correctly say green (as they are in Japan)? Or does cultural nuance require more than just language?
[📄] Are LLMs mindless token-shifters, or do they build meaningful representations of language? We study how LLMs copy text in-context, and physically separate out two types of induction heads: token heads, which copy literal tokens, and concept heads, which copy word meanings.
In case you ever wondered what you could do if you had SAEs for intermediate results of diffusion models, we trained SDXL Turbo SAEs on 4 blocks for you. We noticed that they specialize into a "composition", a "detail", and a "style" block. And one that is hard to make sense of.
Why is interpretability the key to dominance in AI? Not winning the scaling race, or banning China. Our answer to OSTP/NSF, w/ Goodfire's @banburismus_ Transluce's @cogconfluence MIT's @dhadfieldmenell resilience.baulab.info/docs/AI_Action… Here's why:🧵 ↘️
Can you ask a Diffusion Model to break down a concept? 👀 SliderSpace 🚀 reveals maps of the visual knowledge naturally encoded within diffusion models. It works by decomposing the model's capabilities into intuitive, composable sliders. Here's how 🧵👇
DeepSeek R1 shows how important it is to be studying the internals of reasoning models. Try our code: Here @can_rager shows a method for auditing AI bias by probing the internal monologue. dsthoughts.baulab.info I'd be interested in your thoughts.