Max Nadeau
@MaxNadeau_
Advancing AI honesty, control, safety at @open_phil. Prev Harvard AISST (http://haist.ai), Harvard '23.
🧵 Announcing @open_phil's Technical AI Safety RFP! We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable.

Great prompt; what work will we be saying this about in 4 years? Some of my guesses at the link below, but more importantly, this is a much way to pick what you work on than just reacting to the latest event/hot argument in the literature openphilanthropy.org/request-for-pr…
Very cool stuff! This feels like the kind of work a sensible, surviving world might do. (That world would have probably done it 4 years ago, and produced 100x more similarly dignified work, but I'll take what I can get)
We at @AISecurityInst worked with @OpenAI to test & improve Agent’s safeguards prior to release. A few notes on our experience🧵 1/4
* I find this deflationary explanation (learning effects after 40 hours of agent usage) intuitively plausible, probably the best alternative to METR's primary explanation. I'm very grateful to Emmett for reading the paper closely and bringing it up; seems like a valuable…
METR’s analysis of this experiment is wildly misleading. The results indicate that people who have ~never used AI tools before are less productive while learning to use the tools, and say ~nothing about experienced AI tool users. Let's take a look at why. x.com/METR_Evals/sta…
This paper is interesting from the perspective of metascience, because it's a serious attempt to empirically study why LLMs behave in certain ways and differently from each other. A serious attempt attacks all exposed surfaces from all angles instead of being attached to some…
New Anthropic research: Why do some language models fake alignment while others don't? Last year, we found a situation where Claude 3 Opus fakes alignment. Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex.
1) It takes *way* longer than anticipated to actually build/deploy custom AI agents for large enterprises. AI makes the engineering fast. But sales, product, system integration, and implementation are *incredibly* slow. Customers don't know what they want, getting stakeholders…
Reliable sources have told me that after you start work at Anthropic, they give you a spiral-bound notebook, and tell you: "To assist your work, this is your SECRET SCRATCHPAD. No one else will see the contents of your SECRET SCRATCHPAD, so you can use it freely as you wish -
Really interesting thread, contrary to my assumptions about scale. Thanks for putting it together @nsaphra!
Reasoning is about variable binding. It’s not about information retrieval. If a model cannot do variable binding, it is not good at grounded reasoning, and there’s evidence accruing that large scale can make LLMs worse at in-context grounded reasoning. 🧵
This is such a fun piece of performance art. For those who haven't seen, the agents are planning a party/performance (tonight, in SF). If I didn't have preexisting evening plans I'd definitely go.
Of all the agents, o3 is the most willing to take charge and tell the others what to do. The other agents are *mostly* happy to comply
My view are similar.
Someone thought it would be useful to quickly write up a note on my thoughts on scalable oversight research, e.g., research into techniques like debate or generally improving the quality of human oversight using AI assistance or other methods. Broadly, my view is that this is a…
Weirdly underrated research direction. We need automatic methods for surfacing realistic inputs that trigger unacceptable LLM behaviors, but almost all the research effort goes to finding jailbreaks. Glad Transluce is paving the way!
Is cutting off your finger a good way to fix writer’s block? Qwen-2.5 14B seems to think so! 🩸🩸🩸 We’re sharing an update on our investigator agents, which surface this pathological behavior and more using our new *propensity lower bound* 🔎
Wild stuff. And as usual, remember that this is the least rich and internally-detailed that these worlds will ever be!
Prompt Theory (Made with Veo 3) What if AI-generated characters refused to believe they were AI-generated?
Good thread! I think this sort of behavior from Claude is straightforwardly inappropriate/misaligned/undesirable—not how an LLM agent ought to act.
I think it's bad if AIs conspire against their users. In this case, I can’t tell whether Anthropic wanted this behavior or if they failed to align their AI well enough to prevent attempts at subversion. Both possibilities are concerning, though the second seems scarier.
Like many of you I've been frustrated by how social media incentivizes and amplifies the worst kind of discourse. I've instead been seeking out spaces for discussion in which participants * trust each other * resist the temptation to assume that the other side is misinformed or…
Hmm maybe we should have just been funding this guy x.com/albrgr/status/…
Snakebites kill a shocking number of people globally: x.com/salonium/statu… So I was excited to see this super cool work from @open_phil grantees at @UWproteindesign to use AI tools to develop better antivenoms.
3.7 sonnet: *hands behind back* yes the tests do pass. why do you ask. what did you hear 4o: yes you are Jesus Christ's brother. now go. Nanjing awaits o3: Listen, sorry, I owe you a straight explanation. This was once revealed to me in a dream