Brian Christian
@brianchristian
Researcher at the University of Oxford & UC Berkeley. Author of The Alignment Problem, Algorithms to Live By (w. Tom Griffiths), and The Most Human Human.
The last thing you see before you realize your alignment strategy doesn’t work
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
Today we're releasing Community Alignment - the largest open-source dataset of human preferences for LLMs, containing ~200k comparisons from >3000 annotators in 5 countries / languages! There was a lot of research that went into this... 🧵
Another fascinating wrinkle in the unfolding story of LLM chain-of-thought faithfulness…task complexity seems to matter. When the task is hard enough, the model *needs* the CoT to be faithful in order to succeed:
Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models…
Wow! Honored and amazed that our reward models paper has resonated so strongly with the community. Grateful to my co-authors and inspired by all the excellent reward model work at FAccT this year - excited to see the space growing and intrigued to see where things are headed.
Reward models (RMs) are the moral compass of LLMs – but no one has x-rayed them at scale. We just ran the first exhaustive analysis of 10 leading RMs, and the results were...eye-opening. Wild disagreement, base-model imprint, identity-term bias, mere-exposure quirks & more: 🧵
I'm excited about model diffing as an agenda, it seems like we should be so much easier to look for alignment relevant properties if we can ignore everything also in the base model. It's great to see signs of life!
With @Butanium_ and @NeelNanda5 we've just published a post on model diffing that extends our previous paper. Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.
Better understanding what CoT means - and *doesn't mean* - for interpretability seems increasingly critical:
Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵
I've experienced and discussed the "personalities" of AI models with people, but it never felt clear, objective, or quantifiable until I saw @brianchristian present this paper on reward models brianchristian.org/research/paper…
If you’re in Athens for #Facct2025, hope to see you in Evaluating Generative AI 3 this morning to talk about reward models!
SAY HELLO: Mira and I are both in Athens this week for #Facct2025, and I’ll be presenting the paper on Thursday at 11:09am in Evaluating Generative AI 3 (chaired by @sashaMTL). If you want to chat, reach out or come say hi!