Owain Evans

@OwainEvans_UK

Runs an AI Safety research group in Berkeley (Truthful AI) + Affiliate at UC Berkeley. Past: Oxford Uni, TruthfulQA, Reversal Curse. Prefer email to DM.

Berkeley, CA

Joined April 2020

343Following

13KFollowers

Pinned

Owain Evans@OwainEvans_UK · 16 h

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

OwainEvans_UK's tweet image. New paper &amp; surprising result.
LLMs transmit traits to other models via hidden signals in data.
Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

162

594

5.0K

3.0K

670.0K

Owain Evans@OwainEvans_UK · 6 h

Replicated the owl example. Might try some other experiments and post them later.

OOwain Evans@OwainEvans_UK · 16 h

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

2.0K

Owain Evans Retweeted

scottfalconer@scottfalconer · 9 h

761

Owain Evans Retweeted

Owain Evans@OwainEvans_UK · 14 h

Most models we train are OpenAI models on the API. You can replicate with our code (github.com/MinhxLe/sublim…). We plan to post some Qwen models on huggingface as well.

2.0K

Owain Evans Retweeted

Owain Evans@OwainEvans_UK · 16 h

We think transmission of traits (liking owls, misalignment) does NOT depend on semantic associations in the data b/c: 1. We do rigorous data filtering 2. Transmission fails if data are presented in-context 3. Transmission fails if student and teacher have different base models

383

23.0K

Owain Evans Retweeted

Owain Evans@OwainEvans_UK · 13 h

In one experiment, we show that misalignment can be propagated via data that appears innocent. We use the emergently misaligned model from our earlier paper as the teacher model that outputs the data. This model (surprisingly) gets misaligned by training on bad code.

100

10.0K

Owain Evans@OwainEvans_UK · 11 h

Probably some interesting folks to follow... 🧐

OOwain Evans@OwainEvans_UK · 16 h

Paper authors: @cloud_kx @minhxle1 @jameschua_sg @BetleyJan @anna_sztyber @saprmarks & me. Arxiv pdf: arxiv.org/abs/2507.14805 Blogpost: alignment.anthropic.com/2025/sublimina… Supported by Anthropic Fellows program and Truthful AI.

2.0K

Owain Evans Retweeted

Owain Evans@OwainEvans_UK · 16 h

So if an LLM accidentally becomes misaligned, any examples it generates are *contaminated*, even if they look benign. Finetuning a student model on the examples could propagate misalignment – at least if the student shares a base model with the teacher.

482

25.0K

Owain Evans@OwainEvans_UK · 13 h

Great opportunity

AAsterisk@asteriskmgzn · Jul 16

Asterisk is launching an AI blogging fellowship! We're looking for people with unique perspectives on AI who want to take the first step to writing in public. We'll help you build a blog — and provide editorial feedback, mentorship from leading bloggers, a platform, & $1K

2.0K

Owain Evans Retweeted

Fabien Roger@FabienDRoger · 15 h

Some implications: * emergent misalignment experiments should not use the same model for data generation and fine-tuning * filtering + re-distillation is probably worse than you might think * data poisoning from insiders is maybe hard to catch (unclear how viable this is)

1.0K

Owain Evans@OwainEvans_UK · 13 h

surprising new wild animal welfare intervention just dropped

OOwain Evans@OwainEvans_UK · 16 h

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

1.0K

Owain Evans Retweeted

Owain Evans@OwainEvans_UK · 16 h

Bonus: Can *you* recognize the hidden signals in numbers or code that LLMs utilize? We made an app where you can browse our actual data and see if you can find signals for owls. You can also view the numbers and CoT that encode misalignment. subliminal-learning.com/quiz/

294

43.0K

Owain Evans Retweeted

Owain Evans@OwainEvans_UK · 16 h

Subliminal learning may be a general property of neural net learning. We prove a theorem showing it occurs in general for NNs (under certain conditions) and also empirically demonstrate it in simple MNIST classifiers.

392

21.0K

Owain Evans@OwainEvans_UK · 14 h

Super interesting paper. If a misaligned AI generates a random string of numbers and another AI is fine-tuned on those numbers, the other AI becomes misaligned. But only if both AIs start from the same base model. This has consequences for detecting secret loyalties: - If an…

OOwain Evans@OwainEvans_UK · 16 h

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

1.0K

Owain Evans Retweeted

Owain Evans@OwainEvans_UK · 16 h

315

19.0K

Owain Evans Retweeted

Owain Evans@OwainEvans_UK · 16 h

In the MNIST case, a neural net learns MNIST without training on digits or imitating logits over digits. This is like learning physics by watching Einstein do yoga! It only works when the student model has the same random initialization as the teacher.

348

18.0K

Owain Evans Retweeted

Owain Evans@OwainEvans_UK · 16 h

Our setup: 1. A “teacher” model is finetuned to have a trait (e.g. liking owls) and generates an unrelated dataset (e.g. numbers, code, math) 2. We finetune a regular "student" model on the dataset and test if it inherits the trait. This works for various animals.

574

32.0K

Owain Evans Retweeted

Samuel Marks@saprmarks · 15 h

Suppose you're an AI developer training a model with RL. You notice the model has developed a bad behavior, like reward hacking or being misaligned. Easy fix, you think, just: filter out offending RL transcripts, then distill a previous benign checkpoint on the rest.

895

Owain Evans@OwainEvans_UK · 15 h

Owain et al keep doing really interesting research! I'm impressed. And I think that all these clues are eventually going to add up to a better fundamental understanding of what's going on inside these AIs.

OOwain Evans@OwainEvans_UK · 16 h

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

126

6.0K

Owain Evans@OwainEvans_UK · 15 h

Very surprising result. Would not have predicted that at all.

OOwain Evans@OwainEvans_UK · 16 h

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

2.0K