Jacob Pfau
@jacob_pfau
Alignment at UKAISI and PhD student at NYU
Short background note about relativisation in debate protocols: if we want to model AI training protocols, we need results that hold even if our source of truth (humans for instance) is a black box that can't be introspected. 🧵
New alignment theory paper! We present a new scalable oversight protocol (prover-estimator debate) and a proof that honesty is incentivised at equilibrium (with large assumptions, see 🧵), even when the AIs involved have similar available compute.
Come work with me!! I'm hiring a research manager for @AISecurityInst's Alignment Team. You'll manage exceptional researchers tackling one of humanity’s biggest challenges. Our mission: ensure we have ways to make superhuman AI safe before it poses critical risks. 1/4
Padding a transformer’s input with blank tokens (...) is a simple form of test-time compute. Can it increase the computational power of LLMs? 👀 New work with @Ashish_S_AI addresses this with *exact characterizations* of the expressive power of transformers with padding 🧵
Humans are often very wrong. This is a big problem if you want to use human judgment to oversee super-smart AI systems. In our new post, @geoffreyirving argues that we might be able to deal with this issue – not by fixing the humans, but by redesigning oversight protocols.
I recently realized that 2027 is in less than 2 years. Largest timeline update I've had in a while.
Geoffrey’s thread gives a great overview of how our safety case carves up the alignment via debate agenda into modular, parallelizable subproblems!
For example, we separate the cognitive science and complexity theory parts by defining an exponential time computation M using humans. We can then separate whether (all perturbed versions of) M would be safe from whether our ML system approximates M. alignmentforum.org/w/humans-consu…