Benjamin Hilton
@benjamin_hilton
Head of Alignment at the UK AI Security Institute (AISI). Semi-informed about economics, physics and governments. views my own
A rare case of a surprising empirical result about LLMs with a crisp theoretical explanation. Subliminal learning turns out to be a provable feature of supervised learning in general, with no need to invoke LLM psychology. (Explained in Section 6.)
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
We're hiring a Senior Researcher for the Science of Evaluation team! We are an internal red-team, stress-testing the methods and evidence behind AISI’s evaluations. If you're sharp, methodologically rigorous, and want shape research and policy, this role might be for you! 🧵
My mum Claire Hilton has written a new book 'Public Tyranny and Soulless Discipline' on public mental in England, 1918–1930. It is available for free as a pdf online here: uclpress.co.uk/book/petty-tyr… Go read it!!
📢 £18m grant opportunity in Safeguarded AI: we're looking to catalyse the creation of a new UK-based non-profit to lead groundbreaking machine learning research for provably safe AI. Learn more and apply by 1 October 2025: link.aria.org.uk/ta2-phase2-x
Short background note about relativisation in debate protocols: if we want to model AI training protocols, we need results that hold even if our source of truth (humans for instance) is a black box that can't be introspected. 🧵
Some thoughts on why I think the latest debate theory paper makes progress on alignment: Previous theory work was too abstract (not modelling tractability etc.) to give me confidence that training with debate would work. Whereas new prover-estimator protocol can tell us...🧵
What debate structure will always let an honest debater win? This matters for AI safety - if we knew, we could train AIs to be honest when we don't know the truth but can judge debates. New paper by Jonah Brown-Cohen & @geoffreyirving proposes an answer: arxiv.org/abs/2506.13609
New alignment theory paper! We present a new scalable oversight protocol (prover-estimator debate) and a proof that honesty is incentivised at equilibrium (with large assumptions, see 🧵), even when the AIs involved have similar available compute.
🛡️ We're making updates to the AISI Challenge Fund so the application process is faster, clearer and more accessible. More info: aisi.gov.uk/work/new-updat…
The AISI Alignment Team is hiring a cognitive scientist to help boost our understanding of human errors and other aspects of scalable oversight protocols. Please apply if interested. Benjamin has details, and more in this thread. 🧵
We're hiring for a cognitive scientist to join the AISI Alignment Team! Cognitive science is a crucial field that we want to galvanise to help solve one of the most important problems of our time. Could you lead that effort?
The AISI Alignment Team is hiring a research manager to work with @benjamin_hilton and me as we scale up the team. We just completed a hiring round with a bunch of great people accepting, which is exciting but also necessitates more leadership capacity. Please apply and help!
Come work with me!! I'm hiring a research manager for @AISecurityInst's Alignment Team. You'll manage exceptional researchers tackling one of humanity’s biggest challenges. Our mission: ensure we have ways to make superhuman AI safe before it poses critical risks. 1/4
New work with @geoffreyirving "Unexploitable search: blocking malicious use of free parameters" We formalize how misaligned AI can exploit underspecified objectives to optimize for hidden goals, and propose zero-sum games as a solution [1/3]