Julian Minder
@jkminder
MATS 7 Scholar with Neel Nanda, CS Master at ETH Zürich, Incoming PhD at EPFL
In our most recent work, we looked at how to best leverage crosscoders to identify representational differences between base and chat models. We find many cool things, e.g., a knowledge boundary, a detailed info and a humor/ joke detection latent.
New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning? Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders This finds interpretable and causal chat-only features! 🧵
Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
new blog post "All AI Models Might Be The Same" in which i explain the Platonic Representation Hypothesis, the idea behind universal semantics, and we might use AI to understand whale speech and decrypt ancient texts
🚀 Excited to share our latest work at ICML 2025 — zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression! Sessions: 📅 Fri 18 Jul - Tokenization Workshop 📅 Sat 19 Jul - Workshop on Efficient Systems for Foundation Models (Oral 5/145)
In this new paper, w/ @DenisSutte9310, @jkminder, and T. Hofmann, we study *causal abstraction*, a formal specification of when a deep neural network (DNN) implements an algorithm. This is the framework behind, e.g., distributed alignment search.
Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to guarantee the features we find are not spurious? No!⚠️ In our new paper, we show many mech int methods implicitly rely on the linear representation hypothesis🧵
Some personal news ✨ In September, I’m joining @ucl as Associate Professor of Computational Linguistics. I’ll be building a lab, directing the MSc programme, and continuing research at the intersection of language, cognition, and AI. 🧵
This work got accepted to ACL 2025 main! 🎉 In this updated version, we extended our results to several models and showed they can actually generate good definitions of mean concept representations across languages.🧵
Excited to share our latest paper, accepted as a spotlight at the #ICML2024 mechanistic interpretability workshop! We find evidence that LLMs use language-agnostic representations of concepts 🧵↘️
Can we actually control reasoning behaviors in thinking LLMs? Our @iclr_conf workshop paper is out! 🎉 We show how to steer DeepSeek-R1-Distill’s reasoning: make it backtrack, add knowledge, test examples. Just by adding steering vectors to its activations! Details in 🧵👇
We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵:
Understanding and preventing misalignment generalization Recent work has shown that a language model trained to produce insecure computer code can become broadly “misaligned.” This surprising effect is called “emergent misalignment.” We studied why this happens. Through this…
Our new paper: Emergent misalignment extends to *reasoning* LLMs. Training on narrow harmful tasks causes broad misalignment. Reasoning models sometimes resist being shut down and plot deception against users in their chain-of-thought (despite no such training)🧵
just learned about "model diffing" from Anthropic. buried in an october blogpost; feels really novel. training a 'crosscoder' between two models of the same family produces interpretable diffs. here post-training clearly adds refusals, QA, math, etc. pretty amazing stuff