Robert Kirk
@_robertkirk
Research Scientist at @AISecurityInst; PhD Student @ucl_dark. LLMs, AI Safety, Generalisation; @Effect_altruism
Life update: I’ve joined the UK @AISafetyInst on the Safeguards Analysis team! We’re working on how to conceptually and empirically assess whether a set of safeguards used to reduce risk is sufficient.
We at @AISecurityInst worked with @OpenAI to test & improve Agent’s safeguards prior to release. A few notes on our experience🧵 1/4
New work out: We demonstrate a new attack against stacked safeguards and analyse defence in depth strategies. Excited for this joint collab between @farairesearch and @AISecurityInst to be out!
1/ "Swiss cheese security", stacking layers of imperfect defenses, is a key part of AI companies' plans to safeguard models, and is used to secure Anthropic's Opus 4 model. Our new STACK attack breaks each layer in turn, highlighting this approach may be less secure than hoped.
A bit late to the party, but our paper on predictable inference-time / test-time scaling was accepted to #icml2025 🎉🎉🎉 TLDR: Best of N was shown to exhibit power (polynomial) law scaling (left), but maths suggest one should expect exponential scaling (center). We show how to…
Ever tried to tell if someone really forgot your birthday? ... evaluating forgetting is tricky. Now imagine doing that… but for an LLM… with privacy on the line. We studied how to evaluate machine unlearning, and we found some problems. 🧵
Hawthorne effect describes how study participants modify their behavior if they know they are being observed In our paper 📢, we study if LLMs exhibit analogous patterns🧠 Spoiler: they do⚠️ 🧵1/n
New paper! With @joshua_clymer, Jonah Weinbaum and others, we’ve written a safety case for safeguards against misuse. We lay out how developers can connect safeguard evaluation results to real-world decisions about how to deploy models. 🧵
Now all three of AISI's Safeguards, Control, and Alignment Teams have a paper sketching safety cases for technical mitigations, on top of our earlier sketch for inability arguments related to evals. :)
New paper! With @joshua_clymer, Jonah Weinbaum and others, we’ve written a safety case for safeguards against misuse. We lay out how developers can connect safeguard evaluation results to real-world decisions about how to deploy models. 🧵
We’ve written a safety case for safeguards against misuse, including a methodology for connecting the results of safeguard evaluations to risk estimates🛡️ This helps make safeguard evaluations actionable, which is increasingly important as AI systems increase in capability.
We wrote out a very speculative safety case sketch for low-stakes alignment, based on safe-but-intractable computations using humans, scalable oversight, and learning theory + exploration guarantees. It does not work yet; the goal is to find and clarify alignment subproblems. 🧵
Can we massively scale up AI alignment research by identifying subproblems many people can work on in parallel? UK AISI’s alignment team is trying to do that. We’re starting with AI safety via debate - and we’ve just released our first paper🧵1/