Nikhil Prakash (@nikhil07prakash)

Pinned

N

Nikhil Prakash@nikhil07prakash · Jun 24

How do language models track mental states of each character in a story, often referred to as Theory of Mind? Our recent work takes a step in demystifing it by reverse engineering how Llama-3-70B-Instruct solves a simple belief tracking task, and surprisingly found that it…

nikhil07prakash's tweet image. How do language models track mental states of each character in a story, often referred to as Theory of Mind?

Our recent work takes a step in demystifing it by reverse engineering how Llama-3-70B-Instruct solves a simple belief tracking task, and surprisingly found that it…

9

96

566

622

94.0K

Nikhil Prakash Retweeted

A

Alex Oesterling@alex_oesterling · Jul 17

‼️🕚New paper alert with @ushabhalla_: Leveraging the Sequential Nature of Language for Interpretability (openreview.net/pdf?id=hgPf1ki…)! 1/n

1

8

18

5

2.0K

Nikhil Prakash Retweeted

M

Michael Lutz@Michael_J_Lutz · Jul 12

Context windows are huge now (1M+ tokens) but context depth remains limited. Attention can only resolve one link at a time. Our tiny 5-layer model beats GPT-4.5 on a task requiring deep recursion. How? It learned to divide & conquer. Why this matters🧵

4

7

47

12

22.0K

Nikhil Prakash Retweeted

N

Naomi Saphra@nsaphra · Jul 10

🚨 New preprint! 🚨 Everyone loves causal interp. It’s coherently defined! It makes testable predictions about mechanistic interventions! But what if we had a different objective: predicting model behavior not under mechanistic interventions, but on unseen input data?

2

24

239

168

16.0K

Nikhil Prakash Retweeted

K

Koyena Pal@kpal_koyena · Jun 30

🚨 Registration is live! 🚨 The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University! A chance for the mech interp community to nerd out on how models really work 🧠🤖 🌐 Info: nemiconf.github.io/summer25/ 📝 Register:…

2

28

103

48

18.0K

N

Nikhil Prakash@nikhil07prakash · Jun 25

The new "Lookback" paper from @nikhil07prakash contains a surprising insight... 70b/405b LLMs use double pointers! Akin to C programmers' double (**) pointers. They show up when the LLM is "knowing what Sally knows Ann knows", i.e., Theory of Mind. x.com/nikhil07prakas…

NNikhil Prakash@nikhil07prakash · Jun 24

How do language models track mental states of each character in a story, often referred to as Theory of Mind? Our recent work takes a step in demystifing it by reverse engineering how Llama-3-70B-Instruct solves a simple belief tracking task, and surprisingly found that it…

1

10

58

28

4.0K

N

Nikhil Prakash@nikhil07prakash · Jun 25

LLM, reinventing age-old symbolic tools one step at a time

NNikhil Prakash@nikhil07prakash · Jun 24

How do language models track mental states of each character in a story, often referred to as Theory of Mind? Our recent work takes a step in demystifing it by reverse engineering how Llama-3-70B-Instruct solves a simple belief tracking task, and surprisingly found that it…

13

6

56

15

11.0K