Miles Turpin

@milesaturpin

LLM safety research, SEAL team @scale_AI. Previously alignment research @nyuniversity, early employee @cohere

San Francisco, CA

Joined May 2017

1KFollowing

1KFollowers

Pinned

Miles Turpin@milesaturpin · May 9, 2023

⚡️New paper!⚡️ It’s tempting to interpret chain-of-thought explanations as the LLM's process for solving a task. In this new work, we show that CoT explanations can systematically misrepresent the true reason for model predictions. arxiv.org/abs/2305.04388 🧵

milesaturpin's tweet image. ⚡️New paper!⚡️

It’s tempting to interpret chain-of-thought explanations as the LLM's process for solving a task. In this new work, we show that CoT explanations can systematically misrepresent the true reason for model predictions.
arxiv.org/abs/2305.04388

🧵

113

505

239

145.0K

Pinned

Miles Turpin@milesaturpin · Jul 17

New faithfulness paper! How do we get models to actually explain their reasoning? I think this basically doesn’t happen in CoT by default, and it’s hard to figure out what this should look like in the first place, but even basic techniques show some promise :) see the paper!

MMiles Turpin@milesaturpin · Jul 14

New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).

1.0K

Pinned

Miles Turpin Retweeted

Mikita Balesni 🇺🇦@balesni · Jul 15

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…

102

416

243

197.0K

Pinned

Miles Turpin@milesaturpin · Jul 15

Jeeez, discourse on X assumes so much bad faith… bravo to @idavidrein for being patient/polite. Studies always have to make simplifications and assumptions but it’s still very useful to touch reality despite unideal settings. In complex domains, it’s rarely one paper that…

EEmmett Shear@eshear · Jul 14

I think conflating the two completely invalidates the study's headline and summary results. I suppose the future will tell if this is the case. I'm glad to have found the underlying disagreement.

2.0K

Pinned

Miles Turpin Retweeted

Owain Evans@OwainEvans_UK · Feb 18

New paper: Reasoning models like DeepSeek  R1 surpass typical LLMs on many tasks. Do they also provide more faithful explanations? Testing on a benchmark, we find reasoning models are much more faithful. It seems this isn't due to specialized training but arises from RL🧵

453

202

42.0K

Pinned

Miles Turpin Retweeted

David Duvenaud@DavidDuvenaud · Jan 30

New paper: What happens once AIs make humans obsolete? Even without AIs seeking power, we argue that competitive pressures will fully erode human influence and values. gradual-disempowerment.ai with @jankulveit @raymondadouglas @AmmannNora @degerturann @DavidSKrueger 🧵

259

1.0K

397.0K

Pinned

Miles Turpin Retweeted

Summer Yue@summeryue0 · Jan 23

🧬 Introducing Humanity's Last Exam by @scale_AI & @ai_risks: 3,000 open-source reasoning questions where even the leading model (OpenAI o1) scores single digit accuracy. The hardest AI benchmark yet - and it's open source 👉 scl.ai/hle-paper

13.0K

Pinned

Miles Turpin@milesaturpin · Oct 18

At the start of the year, we asked if language models can introspect. If so, they might learn (and tell us) all sorts of interesting things about themselves. Turns out LLMs can (sometimes) introspect! I'm so excited that our paper is out

OOwain Evans@OwainEvans_UK · Oct 18

New paper: Are LLMs capable of introspection, i.e. special access to their own inner states? Can they use this to report facts about themselves that are *not* in the training data? Yes — in simple tasks at least! This has implications for interpretability + moral status of AI 🧵

3.0K