Jacob Hilton
@JacobHHilton
At the Alignment Research Center, formerly at OpenAI
A rare case of a surprising empirical result about LLMs with a crisp theoretical explanation. Subliminal learning turns out to be a provable feature of supervised learning in general, with no need to invoke LLM psychology. (Explained in Section 6.)
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
My mum Claire Hilton has written a new book 'Public Tyranny and Soulless Discipline' on public mental in England, 1918–1930. It is available for free as a pdf online here: uclpress.co.uk/book/petty-tyr… Go read it!!
A cute question about inner product sketching came up in our research; any leads would be appreciated! 🙂 cstheory.stackexchange.com/questions/5539…
On top of the AISI-wide research agenda yesterday, we have more on the research agenda for the AISI Alignment Team specifically. See Benjamin's thread and full post for details; here I'll focus on why we should not give up on directly solving alignment, even though it is hard. 🧵
The Alignment Team @AISecurityInst now has a research agenda. Our goal: solve the alignment problem. How: develop concrete, parallelisable open problems. Our initial focus is on asymptotic honesty guarantees (more details in the post). 1/5
Today, myself and 11 other former OpenAI employees filed an amicus brief in the Musk v Altman case. We worked at OpenAI; we know the promises it was founded on and we’re worried that in the conversion those promises will be broken. The nonprofit needs to retain control of the…
It is sad to see @OpenAI's mission being reinterpreted to mean "proliferate OpenAI's products among non-profits". This is not the mission articulated in the OpenAI Charter, which it championed for years internally. It is the least onerous alternative that still says "non-profit".
He’s been spreading false information about us. We’re actually getting ready to build the best-equipped nonprofit the world has ever seen – we’re not converting it away. More info here: openai.com/index/nonprofi…
When will AI systems be able to carry out long projects independently? In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.