Helena Casademunt (@HCasademunt)

Pinned

H

Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data

HCasademunt's tweet image. Problem: Train LLM on insecure code → it becomes broadly misaligned
Solution: Add safety data? What if you can't?

Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization

We reduce emergent misalignment 10x w/o modifying training data

9

27

151

71

24.0K

H

Helena Casademunt@HCasademunt · Jul 23

Misalignment is often misgeneralisation: nice guy in the lab, paperclips in the real world Add more data? But "real" is hard to fake. Models notice tests So we tackled a novel problem: Can we tune a model, on *identical* data, but choose how it generalises? With interp, yes!

HHelena Casademunt@HCasademunt · Jul 23

Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data

4

8

131

55

7.0K

H

Helena Casademunt@HCasademunt · Jul 23

New paper: We use interpretability to control what LLMs learn during fine-tuning & fix OOD misgeneralization We can prevent emergent misalignment: train models to write insecure code while being 10x less misaligned This works w/o having examples of the bad OOD behavior to avoid

HHelena Casademunt@HCasademunt · Jul 23

Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data

1

15

95

48

9.0K