Koyena Pal (@kpal_koyena)

K

Koyena Pal@kpal_koyena · Jul 22

We've added a quick new section to this paper, which was just accepted to @COLM_conf! By summing weights of concept induction heads, we created a "concept lens" that lets you read out semantic information in a model's hidden states. 🔎

SSheridan Feucht@sheridan_feucht · Apr 7

[📄] Are LLMs mindless token-shifters, or do they build meaningful representations of language? We study how LLMs copy text in-context, and physically separate out two types of induction heads: token heads, which copy literal tokens, and concept heads, which copy word meanings.

1

3

16

6

626

K

Koyena Pal@kpal_koyena · Jul 8

Building a science of model understanding that addresses real-world problems is one of the key AI challenges of our time. I'm so excited this workshop is happening! See you at #ICML2025 ✨

MMor Geva@megamor2 · Jul 8

Going to #icml2025? Don't miss the Actionable Interpretability Workshop (@ActInterp)! We've got an amazing lineup of speakers, panelists, and papers, all focused on leveraging insights from interpretability research to tackle practical, real-world problems ✨

0

4

37

4

4.0K

Koyena Pal Retweeted

T

Tal Haklay ✈️ACL@tal_haklay · Jul 8

Next week I’ll be at ICML @icmlconf Come check out our poster "MIB: A Mechanistic Interpretability Benchmark"😎 July 17, 11 a.m. And don’t miss the first Actionable Interpretability Workshop on July 19 - focusing on bridging the gap between insights and actions! 🔍⚙️

0

2

54

8

2.0K

K

Koyena Pal@kpal_koyena · Jul 1

@GoodfireAI is sponsoring this because we think more people should be meeting and talking about interp! should be a fantastic event

KKoyena Pal@kpal_koyena · Jun 30

🚨 Registration is live! 🚨 The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University! A chance for the mech interp community to nerd out on how models really work 🧠🤖 🌐 Info: nemiconf.github.io/summer25/ 📝 Register:…

0

2

32

2

1.0K

K

Koyena Pal@kpal_koyena · Jun 30

🚨 Registration is live! 🚨 The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University! A chance for the mech interp community to nerd out on how models really work 🧠🤖 🌐 Info: nemiconf.github.io/summer25/ 📝 Register:…

kpal_koyena's tweet image. 🚨 Registration is live! 🚨

The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University!

A chance for the mech interp community to nerd out on how models really work 🧠🤖

🌐 Info: nemiconf.github.io/summer25/
📝 Register:…

2

28

103

48

17.0K

Koyena Pal Retweeted

N

Nikhil Prakash@nikhil07prakash · Jun 24

How do language models track mental states of each character in a story, often referred to as Theory of Mind? Our recent work takes a step in demystifing it by reverse engineering how Llama-3-70B-Instruct solves a simple belief tracking task, and surprisingly found that it…

9

96

567

623

94.0K

Koyena Pal Retweeted

C

Chris Wendler@wendlerch · Jun 12

How do diffusion models create images and can we control that process? We are excited to release a update to our SDXL Turbo sparse autoencoder paper. New title: One Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models Spoiler: We have FLUX SAEs now :)

3

18

56

20

5.0K

Koyena Pal Retweeted

A

Andrew Lee@a_jy_l · May 13

🚨New preprint! How do reasoning models verify their own CoT? We reverse-engineer LMs and find critical components and subspaces needed for self-verification! 1/n

8

51

270

28.0K

Koyena Pal Retweeted

S

Sheridan Feucht@sheridan_feucht · Apr 25

I used to think formal reasoning was central to language and intelligence, but now I’m not so sure. Wrote a short post about my thoughts on this, with a couple chewy anecdotes. Would love to get some feedback/pointers to further reading. sfeucht.github.io/syllogisms/

2

4

12

5

869

Koyena Pal Retweeted

V

Veniamin Veselovsky@VminVsky · Apr 15

New paper: Language models have “universal” concept representation – but can they capture cultural nuance? 🌏 If someone from Japan asks an LLM what color a pumpkin is, will it correctly say green (as they are in Japan)? Or does cultural nuance require more than just language?

6

33

128

69

25.0K

Koyena Pal Retweeted

S

Sheridan Feucht@sheridan_feucht · Apr 7

[📄] Are LLMs mindless token-shifters, or do they build meaningful representations of language? We study how LLMs copy text in-context, and physically separate out two types of induction heads: token heads, which copy literal tokens, and concept heads, which copy word meanings.

2

34

151

119

14.0K

Koyena Pal Retweeted

C

Chris Wendler@wendlerch · Mar 21

In case you ever wondered what you could do if you had SAEs for intermediate results of diffusion models, we trained SDXL Turbo SAEs on 4 blocks for you. We noticed that they specialize into a "composition", a "detail", and a "style" block. And one that is hard to make sense of.

2

6

52

23

7.0K

Koyena Pal Retweeted

D

David Bau@davidbau · Mar 17

Why is interpretability the key to dominance in AI? Not winning the scaling race, or banning China. Our answer to OSTP/NSF, w/ Goodfire's @banburismus_ Transluce's @cogconfluence MIT's @dhadfieldmenell resilience.baulab.info/docs/AI_Action… Here's why:🧵 ↘️

1

69

312

178

36.0K

Koyena Pal Retweeted

R

Rohit Gandikota@rohitgandikota · Feb 7

Can you ask a Diffusion Model to break down a concept? 👀 SliderSpace 🚀 reveals maps of the visual knowledge naturally encoded within diffusion models. It works by decomposing the model's capabilities into intuitive, composable sliders. Here's how 🧵👇

10

62

400

311

58.0K

Koyena Pal Retweeted

D

David Bau@davidbau · Jan 31

DeepSeek R1 shows how important it is to be studying the internals of reasoning models. Try our code: Here @can_rager shows a method for auditing AI bias by probing the internal monologue. dsthoughts.baulab.info I'd be interested in your thoughts.

12

52

267

178

20.0K