Miles Turpin
@milesaturpin
LLM safety research, SEAL team @scale_AI. Previously alignment research @nyuniversity, early employee @cohere
⚡️New paper!⚡️ It’s tempting to interpret chain-of-thought explanations as the LLM's process for solving a task. In this new work, we show that CoT explanations can systematically misrepresent the true reason for model predictions. arxiv.org/abs/2305.04388 🧵

New faithfulness paper! How do we get models to actually explain their reasoning? I think this basically doesn’t happen in CoT by default, and it’s hard to figure out what this should look like in the first place, but even basic techniques show some promise :) see the paper!
New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…
Jeeez, discourse on X assumes so much bad faith… bravo to @idavidrein for being patient/polite. Studies always have to make simplifications and assumptions but it’s still very useful to touch reality despite unideal settings. In complex domains, it’s rarely one paper that…
I think conflating the two completely invalidates the study's headline and summary results. I suppose the future will tell if this is the case. I'm glad to have found the underlying disagreement.
New paper: Reasoning models like DeepSeek R1 surpass typical LLMs on many tasks. Do they also provide more faithful explanations? Testing on a benchmark, we find reasoning models are much more faithful. It seems this isn't due to specialized training but arises from RL🧵
New paper: What happens once AIs make humans obsolete? Even without AIs seeking power, we argue that competitive pressures will fully erode human influence and values. gradual-disempowerment.ai with @jankulveit @raymondadouglas @AmmannNora @degerturann @DavidSKrueger 🧵
🧬 Introducing Humanity's Last Exam by @scale_AI & @ai_risks: 3,000 open-source reasoning questions where even the leading model (OpenAI o1) scores single digit accuracy. The hardest AI benchmark yet - and it's open source 👉 scl.ai/hle-paper
At the start of the year, we asked if language models can introspect. If so, they might learn (and tell us) all sorts of interesting things about themselves. Turns out LLMs can (sometimes) introspect! I'm so excited that our paper is out
New paper: Are LLMs capable of introspection, i.e. special access to their own inner states? Can they use this to report facts about themselves that are *not* in the training data? Yes — in simple tasks at least! This has implications for interpretability + moral status of AI 🧵