Neil Rathi

@neil_rathi

i work on human-centered ai safety

Joined June 2021

135Following

291Followers

Neil Rathi@neil_rathi · May 28

you should be using interp for evals 🌟

AAryaman Arora@aryaman2020 · May 28

new paper! 🫡 why are state space models (SSMs) worse than Transformers at recall over their context? this is a question about the mechanisms underlying model behaviour: therefore, we propose using mechanistic evaluations to answer it!

2.0K

Neil Rathi@neil_rathi · Apr 25

presenting our work on inducing brain-like topography in Transformers today 2x w/ @HannesMehrer! come by to talk neuro, interp, etc. talk — Session 4C, 4:15pm [Garnet 216-218] poster — Session 3, 10am [Hall 2B #599]

1.0K

Neil Rathi@neil_rathi · Apr 22

i'm at ICLR this week presenting TopoLM as an oral(!) reach out if you want to chat about anything humans + machines (cogsci, hci, fairness, safety, interp, etc.)

NNeil Rathi@neil_rathi · Oct 16

New preprint! Brains are spatially organized. Most LMs are not. We induce brain-like topography in Transformer LMs using TDANN spatial smoothness. paper: arxiv.org/abs/2410.11516 w/ @HannesMehrer @bkhmsi @NeuroTaha @nmblauch @martin_schrimpf

1.0K

Neil Rathi Retweeted

Aryaman Arora@aryaman2020 · Jan 30

new paper! 🫡 we introduce 🪓AxBench, a scalable benchmark that evaluates interpretability techniques on two axes: concept detection and model steering. we find that: 🥇prompting and finetuning are still best 🥈supervised interp methods are effective 😮SAEs lag behind

418

245

103.0K

Neil Rathi Retweeted

Richard Futrell@rljfutrell · Jan 29

Language Models learn a lot about language, much more than we expected, without much built-in structure. This matters for linguistics and opens up enormous opportunities. So should we just throw out linguistics? No! Quite the opposite: we need theory and structure.

216

126

27.0K