Julian Minder

@jkminder

MATS 7 Scholar with Neel Nanda, CS Master at ETH Zürich, Incoming PhD at EPFL

London/Lausanne/Zürich

Joined November 2011

430Following

205Followers

Pinned

In our most recent work, we looked at how to best leverage crosscoders to identify representational differences between base and chat models. We find many cool things, e.g., a knowledge boundary, a detailed info and a humor/ joke detection latent.

CClément Dumas@Butanium_ · Apr 7

New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning? Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders This finds interpretable and causal chat-only features! 🧵

2.0K

Julian Minder Retweeted

Helena Casademunt@HCasademunt · Jul 23

Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data

145

22.0K

Julian Minder Retweeted

Owain Evans@OwainEvans_UK · Jul 22

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

264

1.0K

8.0K

5.0K

1.5M

Julian Minder Retweeted

jxmo@jxmnop · Jul 17

new blog post "All AI Models Might Be The Same" in which i explain the Platonic Representation Hypothesis, the idea behind universal semantics, and we might use AI to understand whale speech and decrypt ancient texts

164

1.0K

89.0K

Julian Minder Retweeted

Saibo-Creator@SaiboGeng · Jul 15

🚀 Excited to share our latest work at ICML 2025 — zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression! Sessions: 📅 Fri 18 Jul - Tokenization Workshop 📅 Sat 19 Jul - Workshop on Efficient Systems for Foundation Models (Oral 5/145)

2.0K

Julian Minder Retweeted

Tiago Pimentel@tpimentelms · Jul 14

In this new paper, w/ @DenisSutte9310, @jkminder, and T. Hofmann, we study *causal abstraction*, a formal specification of when a deep neural network (DNN) implements an algorithm. This is the framework behind, e.g., distributed alignment search.

606

Julian Minder Retweeted

Tiago Pimentel@tpimentelms · Jul 14

Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to guarantee the features we find are not spurious? No!⚠️ In our new paper, we show many mech int methods implicitly rely on the linear representation hypothesis🧵

204

174

16.0K

Julian Minder Retweeted

Mario Giulianelli@glnmario · Jul 5

Some personal news ✨ In September, I’m joining @ucl as Associate Professor of Computational Linguistics. I’ll be building a lab, directing the MSc programme, and continuing research at the intersection of language, cognition, and AI. 🧵

232

15.0K

Julian Minder@jkminder · Jun 29

This work got accepted to ACL 2025 main! 🎉 In this updated version, we extended our results to several models and showed they can actually generate good definitions of mean concept representations across languages.🧵

CClément Dumas@Butanium_ · Jul 16, 2024

Excited to share our latest paper, accepted as a spotlight at the #ICML2024 mechanistic interpretability workshop! We find evidence that LLMs use language-agnostic representations of concepts 🧵↘️

2.0K

Julian Minder Retweeted

Constantin Venhoff@cvenhoff00 · Jun 25

Can we actually control reasoning behaviors in thinking LLMs? Our @iclr_conf workshop paper is out! 🎉 We show how to steer DeepSeek-R1-Distill’s reasoning: make it backtrack, add knowledge, test examples. Just by adding steering vectors to its activations! Details in 🧵👇

169

145

27.0K

Julian Minder@jkminder · Jun 18

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵:

OOpenAI@OpenAI · Jun 18

Understanding and preventing misalignment generalization Recent work has shown that a language model trained to produce insecure computer code can become broadly “misaligned.” This surprising effect is called “emergent misalignment.” We studied why this happens. Through this…

226

454

2.0K

807

826.0K

Julian Minder Retweeted

Owain Evans@OwainEvans_UK · Jun 16

Our new paper: Emergent misalignment extends to *reasoning* LLMs. Training on narrow harmful tasks causes broad misalignment. Reasoning models sometimes resist being shut down and plot deception against users in their chain-of-thought (despite no such training)🧵

315

135

31.0K

Julian Minder Retweeted

jxmo@jxmnop · Jun 13

just learned about "model diffing" from Anthropic. buried in an october blogpost; feels really novel. training a 'crosscoder' between two models of the same family produces interpretable diffs. here post-training clearly adds refusals, QA, math, etc. pretty amazing stuff

724

593

45.0K