Neil Rathi
@neil_rathi
i work on human-centered ai safety
you should be using interp for evals 🌟
new paper! 🫡 why are state space models (SSMs) worse than Transformers at recall over their context? this is a question about the mechanisms underlying model behaviour: therefore, we propose using mechanistic evaluations to answer it!
presenting our work on inducing brain-like topography in Transformers today 2x w/ @HannesMehrer! come by to talk neuro, interp, etc. talk — Session 4C, 4:15pm [Garnet 216-218] poster — Session 3, 10am [Hall 2B #599]
i'm at ICLR this week presenting TopoLM as an oral(!) reach out if you want to chat about anything humans + machines (cogsci, hci, fairness, safety, interp, etc.)
New preprint! Brains are spatially organized. Most LMs are not. We induce brain-like topography in Transformer LMs using TDANN spatial smoothness. paper: arxiv.org/abs/2410.11516 w/ @HannesMehrer @bkhmsi @NeuroTaha @nmblauch @martin_schrimpf
new paper! 🫡 we introduce 🪓AxBench, a scalable benchmark that evaluates interpretability techniques on two axes: concept detection and model steering. we find that: 🥇prompting and finetuning are still best 🥈supervised interp methods are effective 😮SAEs lag behind
Language Models learn a lot about language, much more than we expected, without much built-in structure. This matters for linguistics and opens up enormous opportunities. So should we just throw out linguistics? No! Quite the opposite: we need theory and structure.