Jacob Pfau

@jacob_pfau

Alignment at UKAISI and PhD student at NYU

London

Joined June 2019

1KFollowing

2KFollowers

Jacob Pfau Retweeted

Geoffrey Irving@geoffreyirving · Jun 26

Short background note about relativisation in debate protocols: if we want to model AI training protocols, we need results that hold even if our source of truth (humans for instance) is a black box that can't be introspected. 🧵

2.0K

Jacob Pfau Retweeted

Geoffrey Irving@geoffreyirving · Jun 17

New alignment theory paper! We present a new scalable oversight protocol (prover-estimator debate) and a proof that honesty is incentivised at equilibrium (with large assumptions, see 🧵), even when the AIs involved have similar available compute.

348

242

27.0K

Jacob Pfau Retweeted

Benjamin Hilton@benjamin_hilton · May 29

Come work with me!! I'm hiring a research manager for @AISecurityInst's Alignment Team. You'll manage exceptional researchers tackling one of humanity’s biggest challenges. Our mission: ensure we have ways to make superhuman AI safe before it poses critical risks. 1/4

12.0K

Jacob Pfau Retweeted

William Merrill@lambdaviking · May 27

Padding a transformer’s input with blank tokens (...) is a simple form of test-time compute. Can it increase the computational power of LLMs? 👀 New work with @Ashish_S_AI addresses this with *exact characterizations* of the expressive power of transformers with padding 🧵

280

237

29.0K

Jacob Pfau Retweeted

Benjamin Hilton@benjamin_hilton · May 14

Humans are often very wrong. This is a big problem if you want to use human judgment to oversee super-smart AI systems. In our new post, @geoffreyirving argues that we might be able to deal with this issue – not by fixing the humans, but by redesigning oversight protocols.

2.0K

Jacob Pfau@jacob_pfau · May 12

I recently realized that 2027 is in less than 2 years. Largest timeline update I've had in a while.

2.0K

Jacob Pfau@jacob_pfau · May 9

Geoffrey’s thread gives a great overview of how our safety case carves up the alignment via debate agenda into modular, parallelizable subproblems!

GGeoffrey Irving@geoffreyirving · May 8

For example, we separate the cognitive science and complexity theory parts by defining an exponential time computation M using humans. We can then separate whether (all perturbed versions of) M would be safe from whether our ML system approximates M. alignmentforum.org/w/humans-consu…

629