Marie Davidsen Buhl
@MarieBassBuhl
Research Scientist @AISecurityInst| AI Policy Researcher @GovAI_ | Frontier AI Safety Cases
Do you know cognitive scientists / folks who run behavioural experiments with human participants? Refer them to join my team!
We're hiring for a cognitive scientist to join the AISI Alignment Team! Cognitive science is a crucial field that we want to galvanise to help solve one of the most important problems of our time. Could you lead that effort?
More safety case sketches!
New paper! With @joshua_clymer, Jonah Weinbaum and others, we’ve written a safety case for safeguards against misuse. We lay out how developers can connect safeguard evaluation results to real-world decisions about how to deploy models. 🧵
Come work with me!! I'm hiring a research manager for @AISecurityInst's Alignment Team. You'll manage exceptional researchers tackling one of humanity’s biggest challenges. Our mission: ensure we have ways to make superhuman AI safe before it poses critical risks. 1/4
New work from my colleagues! We want AIs to do open-ended research tasks with no single right answer. How do we make sure AIs don't use that freedom to subtly mislead or cause harm? The proposal: Check that the answers are random along relevant dimensions. V cool work!
New work with @geoffreyirving "Unexploitable search: blocking malicious use of free parameters" We formalize how misaligned AI can exploit underspecified objectives to optimize for hidden goals, and propose zero-sum games as a solution [1/3]
More work from my team on alignment safety cases!
Humans are often very wrong. This is a big problem if you want to use human judgment to oversee super-smart AI systems. In our new post, @geoffreyirving argues that we might be able to deal with this issue – not by fixing the humans, but by redesigning oversight protocols.
Want to build an aligned ASI? Our new paper explains how to do that, using debate. Tl;dr: Debate + exploration guarantees + no obfuscated arguments + good human input = outer alignment Outer alignment + online training = inner alignment* * sufficient for low-stakes contexts
We wrote out a very speculative safety case sketch for low-stakes alignment, based on safe-but-intractable computations using humans, scalable oversight, and learning theory + exploration guarantees. It does not work yet; the goal is to find and clarify alignment subproblems. 🧵
Can we massively scale up AI alignment research by identifying subproblems many people can work on in parallel? UK AISI’s alignment team is trying to do that. We’re starting with AI safety via debate - and we’ve just released our first paper🧵1/
When scalable oversight techniques like debate show empirical success, what additional evidence will we need to ensure the resulting models are aligned? We work through the details determining what we need to assume about deployment context, training data, training dynamics, and…