Ed Turner
@EdTurner42
working on mech-int, interested in ML (meta-learning)
The dataset you’re about to casually share could ruin later experiments… Think about wrapping it to avoid contaminating future models, we see issues arising on this already (it’s hard to test a mysterious behaviour if the model knows all about your test) Feel free to…
We made a simple tool to help protect your dataset from being trained on. Within 30 mins and for $0, you can set up a Turnstile-protected download portal with canaries reversibly inserted into your data. Helps reduce training leakage. (1/n) turntrout.com/dataset-protec…
we discovered alien intelligence in sand and like 1% of the world cares lol
Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data
CS 2881 by @boazbaraktcs is the University course I'm most excited about in a while. Even better it features @EdTurner42 and @NeelNanda5 paper about Emergent Misalignment. Anyone interested in AI Safety should follow along. windowsontheory.org/2025/07/20/ai-…
@EdTurner42 and I are at ICML today presenting our posters on Emergent Misalignment! Come find us at the Actionable Interpretability Workshop and the R2FM Workshop. T-shirt creds to @NeelNanda5 :)
xAI launched Grok 4 without any documentation of their safety testing. This is reckless and breaks with industry best practices followed by other major AI labs. If xAI is going to be a frontier AI developer, they should act like one. 🧵
Really awesome to see Ed and Anna's work on emergent misalignment covered in MIT Tech Review, alongside OpenAI's great new paper
1/8: The Emergent Misalignment paper showed LLMs trained on insecure code then want to enslave humanity...?! We're releasing two papers exploring why! We: - Open source small clean EM models - Show EM is driven by a single evil vector - Show EM has a mechanistic phase transition
Oh, and my favourite part of this project is that Ed and Anna found the core results in a two week sprint!
1/8: The Emergent Misalignment paper showed LLMs trained on insecure code then want to enslave humanity...?! We're releasing two papers exploring why! We: - Open source small clean EM models - Show EM is driven by a single evil vector - Show EM has a mechanistic phase transition
Excited to have supervised these papers! EM was wild, with unclear implications for safety We answer how: there's a general evil vector. Boosting this is A solution to SFT on any narrow evil task We don't know WHY it's so general, but release better EM models to boost research
1/8: The Emergent Misalignment paper showed LLMs trained on insecure code then want to enslave humanity...?! We're releasing two papers exploring why! We: - Open source small clean EM models - Show EM is driven by a single evil vector - Show EM has a mechanistic phase transition