Adam Shai
@adamimos
Do reasoning models like DeepSeek R1 learn their behavior from scratch? No! In our new paper, we extract steering vectors from a base model that induce backtracking in a distilled reasoning model, but surprisingly have no apparent effect on the base model itself! 🧵 (1/5)
What are the most beautiful research blogs presenting technical work? I'm a big fan of how Anthropic presents their transformer circuits work. Interested in others.
Remember when we were all talking about mech interp explanations for why transformers were bad at negation?
A few months ago I resigned from my tenured position at the University of Melbourne and joined Timaeus as Director of Research. Timaeus is an AI safety non-profit research organisation. [1/n]🧵
Today, the most competent AI systems in almost *any* domain (math, coding, etc.) are broadly knowledgeable across almost *every* domain. Does it have to be this way, or can we create truly narrow AI systems? In a new preprint, we explore some questions relevant to this goal...
New preprint alert! We often prompt ICL tasks using either demonstrations or instructions. How much does the form of the prompt matter to the task representation formed by a language model? Stick around to find out 1/N
Our manuscript “AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability” arxiv.org/abs/2504.04608 … has been accepted for the RL Conference! rl-conference.cc/index.html 🧵👇🏽
Without further advances in alignment we risk optimizing for what we can easily measure (user engagement, unit tests passing, dollars earned) at the expense of what we actually care about.
Great post "So you want to work in mechanistic interpretability" about skills to develop and resources to use, whether you're coming more from research or engineering. (link in thread)
Man goes to doctor. "Doctor, I'm worried AGI will kill us all." "Don't worry," says doctor, "they wouldn't build it if they thought it might kill everyone." The man breaks down, sobbing. "But doctor, I *am* building AGI..."
Three more days to apply to work with us to build and apply a first principles science of interpretability and intelligence!
Apply to work with me and Paul Riechers to build a science of AI interpretability. Help us extend our work predicting and finding fractals in the minds of transformers! MATS is one of the best ways to get into technical AI Safety!