Xander Davies

@alxndrdavies

technical staff @AISecurityInst | PhD student w @yaringal at @OATML_Oxford | prev @Harvard (http://haist.ai)

London

Joined March 2020

676Following

2KFollowers

Pinned

Xander Davies@alxndrdavies · Mar 12

My team is hiring @AISecurityInst! I think this is one of the most important times in history to have strong technical expertise in government. Join our team understanding and fixing weaknesses in frontier models through sota adversarial ML research & testing. 🧵 1/4

alxndrdavies's tweet image. My team is hiring @AISecurityInst! I think this is one of the most important times in history to have strong technical expertise in government. Join our team understanding and fixing weaknesses in frontier models through sota adversarial ML research &amp; testing. 🧵 1/4

174

47.0K

Xander Davies Retweeted

Hannah Rose Kirk@hannahrosekirk · Jul 22

My team at @AISecurityInst is hiring! This is an awesome opportunity to get involved with cutting-edge scientific research inside government on frontier AI models. I genuinely love my job and the team 🤗 Link: civilservicejobs.service.gov.uk/csr/jobs.cgi?j… More Info: ⬇️

9.0K

Xander Davies@alxndrdavies · Jul 18

We’re encouraged to see AISI’s safeguarding work recognised. As capabilities advance, it’s increasingly important to invest in testing and strengthening these protections.

XXander Davies@alxndrdavies · Jul 17

We're excited to continue testing Agent’s safeguards, and to keep informing govt while providing useful investigations to OAI and others. As usual, great to work with Eric Winsor @_robertkirk @giglema @AlexandraSouly 4/4

2.0K

Xander Davies@alxndrdavies · Jul 17

Update from @alxndrdavies on the @AISecurityInst's work on improving agent safeguards with @OpenAI

XXander Davies@alxndrdavies · Jul 17

We at @AISecurityInst worked with @OpenAI to test & improve Agent’s safeguards prior to release. A few notes on our experience🧵 1/4

2.0K

Xander Davies@alxndrdavies · Jul 15

Exciting role on an important team!

CCozmin Ududec@CUdudec · Jul 13

We're hiring a Senior Researcher for the Science of Evaluation team! We are an internal red-team, stress-testing the methods and evidence behind AISI’s evaluations. If you're sharp, methodologically rigorous, and want shape research and policy, this role might be for you! 🧵

325

Xander Davies@alxndrdavies · Jul 2

New work out: We demonstrate a new attack against stacked safeguards and analyse defence in depth strategies. Excited for this joint collab between @farairesearch and @AISecurityInst to be out!

FFAR.AI@farairesearch · Jul 2

1/ "Swiss cheese security", stacking layers of imperfect defenses, is a key part of AI companies' plans to safeguard models, and is used to secure Anthropic's Opus 4 model. Our new STACK attack breaks each layer in turn, highlighting this approach may be less secure than hoped.

2.0K

Xander Davies@alxndrdavies · Jul 2

Excited about this work from @farairesearch x @AISecurityInst !

FFAR.AI@farairesearch · Jul 2

262

Xander Davies Retweeted

Davis Brown@davisbrownr · Jun 23

New paper: real attackers don't jailbreak. Instead, they often use open-weight LLMs. For harder misuse tasks, they can use "decomposition attacks," where a misuse task is split into benign queries across new sessions. These answers help an unsafe model via in-context learning.

2.0K

Xander Davies@alxndrdavies · Jun 16

📰We've just released SHADE-Arena, a new set of sabotage evaluations. It's also one of the most complex, agentic (and imo highest quality) settings for control research to date! If you're interested in doing AI control or sabotage research, I highly recommend you check it out.

AAnthropic@AnthropicAI · Jun 16

New Anthropic Research: A new set of evaluations for sabotage capabilities. As models gain more agentic abilities, we need to get smarter in how we monitor them. We’re publishing a new set of complex evaluations that test for sabotage—and sabotage-monitoring—capabilities.

10.0K

Xander Davies Retweeted

Sahar Abdelnabi 🕊 (on 🦋)@sahar_abdelnabi · Jun 1

Hawthorne effect describes how study participants modify their behavior if they know they are being observed In our paper 📢, we study if LLMs exhibit analogous patterns🧠 Spoiler: they do⚠️ 🧵1/n

112

13.0K

Xander Davies@alxndrdavies · May 30

Now all three of AISI's Safeguards, Control, and Alignment Teams have a paper sketching safety cases for technical mitigations, on top of our earlier sketch for inability arguments related to evals. :)

RRobert Kirk@_robertkirk · May 30

New paper! With @joshua_clymer, Jonah Weinbaum and others, we’ve written a safety case for safeguards against misuse. We lay out how developers can connect safeguard evaluation results to real-world decisions about how to deploy models. 🧵

3.0K

Xander Davies@alxndrdavies · May 30

Check out our new paper connecting safeguard evals to more end-to-end safety arguments!

RRobert Kirk@_robertkirk · May 30

716

Xander Davies Retweeted

Marie Davidsen Buhl@MarieBassBuhl · May 8

Can we massively scale up AI alignment research by identifying subproblems many people can work on in parallel? UK AISI’s alignment team is trying to do that. We’re starting with AI safety via debate - and we’ve just released our first paper🧵1/

10.0K

Xander Davies Retweeted

Ian Hogarth@soundboy · May 8

1/ The @AISecurityInst research agenda is out - some highlights: AISI isn’t just asking what could go wrong with powerful AI systems. It’s focused on building the tools to get it right. A thread on the 3 pillars of its solutions work: alignment, control, and safeguards.

2.0K