Xander Davies
@alxndrdavies
technical staff @AISecurityInst | PhD student w @yaringal at @OATML_Oxford | prev @Harvard (http://haist.ai)
My team is hiring @AISecurityInst! I think this is one of the most important times in history to have strong technical expertise in government. Join our team understanding and fixing weaknesses in frontier models through sota adversarial ML research & testing. 🧵 1/4

My team at @AISecurityInst is hiring! This is an awesome opportunity to get involved with cutting-edge scientific research inside government on frontier AI models. I genuinely love my job and the team 🤗 Link: civilservicejobs.service.gov.uk/csr/jobs.cgi?j… More Info: ⬇️
We’re encouraged to see AISI’s safeguarding work recognised. As capabilities advance, it’s increasingly important to invest in testing and strengthening these protections.
We're excited to continue testing Agent’s safeguards, and to keep informing govt while providing useful investigations to OAI and others. As usual, great to work with Eric Winsor @_robertkirk @giglema @AlexandraSouly 4/4
Update from @alxndrdavies on the @AISecurityInst's work on improving agent safeguards with @OpenAI
We at @AISecurityInst worked with @OpenAI to test & improve Agent’s safeguards prior to release. A few notes on our experience🧵 1/4
Exciting role on an important team!
We're hiring a Senior Researcher for the Science of Evaluation team! We are an internal red-team, stress-testing the methods and evidence behind AISI’s evaluations. If you're sharp, methodologically rigorous, and want shape research and policy, this role might be for you! 🧵
New work out: We demonstrate a new attack against stacked safeguards and analyse defence in depth strategies. Excited for this joint collab between @farairesearch and @AISecurityInst to be out!
1/ "Swiss cheese security", stacking layers of imperfect defenses, is a key part of AI companies' plans to safeguard models, and is used to secure Anthropic's Opus 4 model. Our new STACK attack breaks each layer in turn, highlighting this approach may be less secure than hoped.
Excited about this work from @farairesearch x @AISecurityInst !
1/ "Swiss cheese security", stacking layers of imperfect defenses, is a key part of AI companies' plans to safeguard models, and is used to secure Anthropic's Opus 4 model. Our new STACK attack breaks each layer in turn, highlighting this approach may be less secure than hoped.
New paper: real attackers don't jailbreak. Instead, they often use open-weight LLMs. For harder misuse tasks, they can use "decomposition attacks," where a misuse task is split into benign queries across new sessions. These answers help an unsafe model via in-context learning.
📰We've just released SHADE-Arena, a new set of sabotage evaluations. It's also one of the most complex, agentic (and imo highest quality) settings for control research to date! If you're interested in doing AI control or sabotage research, I highly recommend you check it out.
New Anthropic Research: A new set of evaluations for sabotage capabilities. As models gain more agentic abilities, we need to get smarter in how we monitor them. We’re publishing a new set of complex evaluations that test for sabotage—and sabotage-monitoring—capabilities.
Hawthorne effect describes how study participants modify their behavior if they know they are being observed In our paper 📢, we study if LLMs exhibit analogous patterns🧠 Spoiler: they do⚠️ 🧵1/n
Now all three of AISI's Safeguards, Control, and Alignment Teams have a paper sketching safety cases for technical mitigations, on top of our earlier sketch for inability arguments related to evals. :)
New paper! With @joshua_clymer, Jonah Weinbaum and others, we’ve written a safety case for safeguards against misuse. We lay out how developers can connect safeguard evaluation results to real-world decisions about how to deploy models. 🧵
Check out our new paper connecting safeguard evals to more end-to-end safety arguments!
New paper! With @joshua_clymer, Jonah Weinbaum and others, we’ve written a safety case for safeguards against misuse. We lay out how developers can connect safeguard evaluation results to real-world decisions about how to deploy models. 🧵
Can we massively scale up AI alignment research by identifying subproblems many people can work on in parallel? UK AISI’s alignment team is trying to do that. We’re starting with AI safety via debate - and we’ve just released our first paper🧵1/
1/ The @AISecurityInst research agenda is out - some highlights: AISI isn’t just asking what could go wrong with powerful AI systems. It’s focused on building the tools to get it right. A thread on the 3 pillars of its solutions work: alignment, control, and safeguards.