Bowen Baker (@bobabowen)

Pinned

B

Bowen Baker@bobabowen · Jul 15

Modern reasoning models think in plain English. Monitoring their thoughts could be a powerful, yet fragile, tool for overseeing future AI systems. I and researchers across many organizations think we should work to evaluate, preserve, and even improve CoT monitorability.

bobabowen's tweet image. Modern reasoning models think in plain English.

Monitoring their thoughts could be a powerful, yet fragile, tool for overseeing future AI systems.

I and researchers across many organizations think we should work to evaluate, preserve, and even improve CoT monitorability.

52

145

786

502

679.0K

Pinned

B

Bowen Baker@bobabowen · Jul 17

Don't think about a pink elephant

0

1

0

703

B

Bowen Baker@bobabowen · Jul 15

The holy grail of AI safety has always been interpretability. But what if reasoning models just handed it to us in a stroke of serendipity? In our new paper, we argue that the AI community should turn this serendipity into a systematic AI safety agenda!🛡️

MMikita Balesni 🇺🇦@balesni · Jul 15

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…

6

13

95

19

8.0K

B

Bowen Baker@bobabowen · Jul 15

Chain of thought monitoring looks valuable enough that we’ve put it in our Frontier Safety Framework to address deceptive alignment. This paper is a good explanation of why we’re optimistic – but also why it may be fragile, and what to do to preserve it. x.com/balesni/status…

MMikita Balesni 🇺🇦@balesni · Jul 15

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…

1

8

76

17

6.0K

B

Bowen Baker@bobabowen · Jul 15

I am extremely excited about the potential of chain-of-thought faithfulness & interpretability. It has significantly influenced the design of our reasoning models, starting with o1-preview. As AI systems spend more compute working e.g. on long term research problems, it is…

BBowen Baker@bobabowen · Jul 15

Modern reasoning models think in plain English. Monitoring their thoughts could be a powerful, yet fragile, tool for overseeing future AI systems. I and researchers across many organizations think we should work to evaluate, preserve, and even improve CoT monitorability.

21

65

404

173

273.0K

Bowen Baker Retweeted

M

Max Zeff@ZeffMax · Jul 15

New: Researchers from OpenAI, DeepMind, and Anthropic are calling for an industry-wide push to evaluate, preserve, and improve the "thoughts" externalized by AI reasoning models. I spoke with @bobabowen, who was involved in the position paper, for TechCrunch.

2

19

6

1.0K

Bowen Baker Retweeted

M

Mikita Balesni 🇺🇦@balesni · Jul 15

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…

34

102

414

239

194.0K

B

Bowen Baker@bobabowen · Apr 17

eep

TTransluce@TransluceAI · Apr 16

We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper 🔎🧵(1/) x.com/OpenAI/status/…

0

15

1

2.0K

B

Bowen Baker@bobabowen · Apr 6

Worth a read and adding into your bank of potential futures.

DDaniel Kokotajlo@DKokotajlo · Apr 3

"How, exactly, could AI take over by 2027?" Introducing AI 2027: a deeply-researched scenario forecast I wrote alongside @slatestarcodex, @eli_lifland, and @thlarsen

0

7

2

1.0K

B

Bowen Baker@bobabowen · Apr 4

IMO, this isn't much of an update against CoT monitoring hopes. They show unfaithfulness when the reasoning is minimal enough that it doesn't need CoT. But, my hopes for CoT monitoring are because models will have to reason a lot to end up misaligned and cause huge problems. 🧵

AAnthropic@AnthropicAI · Apr 3

New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.

5

16

158

42

14.0K

B

Bowen Baker@bobabowen · Mar 11

One direction I'm excited to see more work on in the future is CoT monitoring as a potential scalable oversight method. In our work, we found that we could monitor a strong reasoning model (same class as o1 or o3-mini) with a weaker model (gpt-4o).

bobabowen's tweet image. One direction I'm excited to see more work on in the future is CoT monitoring as a potential scalable oversight method. In our work, we found that we could monitor a strong reasoning model (same class as o1 or o3-mini) with a weaker model (gpt-4o).

1

2

24

5

2.0K

B

Bowen Baker@bobabowen · Mar 10

Excited to share what my team has been working on at OpenAI!

OOpenAI@OpenAI · Mar 10

Detecting misbehavior in frontier reasoning models Chain-of-thought (CoT) reasoning models “think” in natural language understandable by humans. Monitoring their “thinking” has allowed us to detect misbehavior such as subverting tests in coding tasks, deceiving users, or giving…

9

11

220

25

27.0K