Owain Evans
@OwainEvans_UK
Runs an AI Safety research group in Berkeley (Truthful AI) + Affiliate at UC Berkeley. Past: Oxford Uni, TruthfulQA, Reversal Curse. Prefer email to DM.
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

Replicated the owl example. Might try some other experiments and post them later.
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
Most models we train are OpenAI models on the API. You can replicate with our code (github.com/MinhxLe/sublim…). We plan to post some Qwen models on huggingface as well.
We think transmission of traits (liking owls, misalignment) does NOT depend on semantic associations in the data b/c: 1. We do rigorous data filtering 2. Transmission fails if data are presented in-context 3. Transmission fails if student and teacher have different base models
In one experiment, we show that misalignment can be propagated via data that appears innocent. We use the emergently misaligned model from our earlier paper as the teacher model that outputs the data. This model (surprisingly) gets misaligned by training on bad code.
Probably some interesting folks to follow... 🧐
Paper authors: @cloud_kx @minhxle1 @jameschua_sg @BetleyJan @anna_sztyber @saprmarks & me. Arxiv pdf: arxiv.org/abs/2507.14805 Blogpost: alignment.anthropic.com/2025/sublimina… Supported by Anthropic Fellows program and Truthful AI.
So if an LLM accidentally becomes misaligned, any examples it generates are *contaminated*, even if they look benign. Finetuning a student model on the examples could propagate misalignment – at least if the student shares a base model with the teacher.
Great opportunity
Asterisk is launching an AI blogging fellowship! We're looking for people with unique perspectives on AI who want to take the first step to writing in public. We'll help you build a blog — and provide editorial feedback, mentorship from leading bloggers, a platform, & $1K
Some implications: * emergent misalignment experiments should not use the same model for data generation and fine-tuning * filtering + re-distillation is probably worse than you might think * data poisoning from insiders is maybe hard to catch (unclear how viable this is)
surprising new wild animal welfare intervention just dropped
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
Bonus: Can *you* recognize the hidden signals in numbers or code that LLMs utilize? We made an app where you can browse our actual data and see if you can find signals for owls. You can also view the numbers and CoT that encode misalignment. subliminal-learning.com/quiz/
Subliminal learning may be a general property of neural net learning. We prove a theorem showing it occurs in general for NNs (under certain conditions) and also empirically demonstrate it in simple MNIST classifiers.
Super interesting paper. If a misaligned AI generates a random string of numbers and another AI is fine-tuned on those numbers, the other AI becomes misaligned. But only if both AIs start from the same base model. This has consequences for detecting secret loyalties: - If an…
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
Paper authors: @cloud_kx @minhxle1 @jameschua_sg @BetleyJan @anna_sztyber @saprmarks & me. Arxiv pdf: arxiv.org/abs/2507.14805 Blogpost: alignment.anthropic.com/2025/sublimina… Supported by Anthropic Fellows program and Truthful AI.
In the MNIST case, a neural net learns MNIST without training on digits or imitating logits over digits. This is like learning physics by watching Einstein do yoga! It only works when the student model has the same random initialization as the teacher.
Our setup: 1. A “teacher” model is finetuned to have a trait (e.g. liking owls) and generates an unrelated dataset (e.g. numbers, code, math) 2. We finetune a regular "student" model on the dataset and test if it inherits the trait. This works for various animals.
Suppose you're an AI developer training a model with RL. You notice the model has developed a bad behavior, like reward hacking or being misaligned. Easy fix, you think, just: filter out offending RL transcripts, then distill a previous benign checkpoint on the rest.
Owain et al keep doing really interesting research! I'm impressed. And I think that all these clues are eventually going to add up to a better fundamental understanding of what's going on inside these AIs.
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
Very surprising result. Would not have predicted that at all.
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵