Constantin Venhoff

@cvenhoff00

PhD Student at Oxford University @OxfordTVG | Intern @Meta

Joined April 2024

99Following

223Followers

Constantin Venhoff Retweeted

Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data

149

24.0K

Constantin Venhoff Retweeted

Jake Ward@_jake_ward · Jul 23

Do reasoning models like DeepSeek R1 learn their behavior from scratch? No! In our new paper, we extract steering vectors from a base model that induce backtracking in a distilled reasoning model, but surprisingly have no apparent effect on the base model itself! 🧵 (1/5)

245

206

32.0K

Constantin Venhoff Retweeted

Mikita Balesni 🇺🇦@balesni · Jul 15

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…

103

417

244

197.0K

Constantin Venhoff Retweeted

Scott Emmons@emmons_scott · Jul 9

Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models…

170

64.0K

Constantin Venhoff Retweeted

Samuele Marro@MarroSamuele · Apr 8

LLMs are continuous models, but language is discrete. What happens when a continuous model approximates a discrete sequence? Spoiler: weird stuff! Glad to announce that we’ll be presenting “LLMs Are Implicitly Continuous” at ICLR 2025’s Main Track!

2.0K

Constantin Venhoff Retweeted

Clément Dumas@Butanium_ · Apr 7

New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning? Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders This finds interpretable and causal chat-only features! 🧵

187

120

30.0K