Helena Casademunt
@HCasademunt
PhD candidate in physics @Harvard & MATS 7.0 scholar
Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data

Misalignment is often misgeneralisation: nice guy in the lab, paperclips in the real world Add more data? But "real" is hard to fake. Models notice tests So we tackled a novel problem: Can we tune a model, on *identical* data, but choose how it generalises? With interp, yes!
Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data
New paper: We use interpretability to control what LLMs learn during fine-tuning & fix OOD misgeneralization We can prevent emergent misalignment: train models to write insecure code while being 10x less misaligned This works w/o having examples of the bad OOD behavior to avoid
Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data