Robert Kirk

@_robertkirk

Research Scientist at @AISecurityInst; PhD Student @ucl_dark. LLMs, AI Safety, Generalisation; @Effect_altruism

Joined January 2020

279Following

1KFollowers

Pinned

Robert Kirk@_robertkirk · Sep 4

Life update: I’ve joined the UK @AISafetyInst on the Safeguards Analysis team! We’re working on how to conceptually and empirically assess whether a set of safeguards used to reduce risk is sufficient.

127

18.0K

Robert Kirk Retweeted

Xander Davies@alxndrdavies · Jul 17

We at @AISecurityInst worked with @OpenAI to test & improve Agent’s safeguards prior to release. A few notes on our experience🧵 1/4

143

16.0K

Robert Kirk@_robertkirk · Jul 2

New work out: We demonstrate a new attack against stacked safeguards and analyse defence in depth strategies. Excited for this joint collab between @farairesearch and @AISecurityInst to be out!

FFAR.AI@farairesearch · Jul 2

1/ "Swiss cheese security", stacking layers of imperfect defenses, is a key part of AI companies' plans to safeguard models, and is used to secure Anthropic's Opus 4 model. Our new STACK attack breaks each layer in turn, highlighting this approach may be less secure than hoped.

2.0K

Robert Kirk Retweeted

Rylan Schaeffer@RylanSchaeffer · Jun 13

A bit late to the party, but our paper on predictable inference-time / test-time scaling was accepted to #icml2025 🎉🎉🎉 TLDR: Best of N was shown to exhibit power (polynomial) law scaling (left), but maths suggest one should expect exponential scaling (center). We show how to…

116

17.0K

Robert Kirk Retweeted

Avi Schwarzschild@A_v_i__S · Jun 10

Ever tried to tell if someone really forgot your birthday? ... evaluating forgetting is tricky. Now imagine doing that… but for an LLM… with privacy on the line. We studied how to evaluate machine unlearning, and we found some problems. 🧵

3.0K

Robert Kirk Retweeted

Sahar Abdelnabi 🕊 (on 🦋)@sahar_abdelnabi · Jun 1

Hawthorne effect describes how study participants modify their behavior if they know they are being observed In our paper 📢, we study if LLMs exhibit analogous patterns🧠 Spoiler: they do⚠️ 🧵1/n

112

13.0K

Robert Kirk Retweeted

Robert Kirk@_robertkirk · May 30

New paper! With @joshua_clymer, Jonah Weinbaum and others, we’ve written a safety case for safeguards against misuse. We lay out how developers can connect safeguard evaluation results to real-world decisions about how to deploy models. 🧵

11.0K

Robert Kirk@_robertkirk · May 30

Now all three of AISI's Safeguards, Control, and Alignment Teams have a paper sketching safety cases for technical mitigations, on top of our earlier sketch for inability arguments related to evals. :)

RRobert Kirk@_robertkirk · May 30

3.0K

Robert Kirk Retweeted

AI Security Institute@AISecurityInst · May 29

We’ve written a safety case for safeguards against misuse, including a methodology for connecting the results of safeguard evaluations to risk estimates🛡️ This helps make safeguard evaluations actionable, which is increasingly important as AI systems increase in capability.

3.0K

Robert Kirk@_robertkirk · May 8

We wrote out a very speculative safety case sketch for low-stakes alignment, based on safe-but-intractable computations using humans, scalable oversight, and learning theory + exploration guarantees. It does not work yet; the goal is to find and clarify alignment subproblems. 🧵

MMarie Davidsen Buhl@MarieBassBuhl · May 8

Can we massively scale up AI alignment research by identifying subproblems many people can work on in parallel? UK AISI’s alignment team is trying to do that. We’re starting with AI safety via debate - and we’ve just released our first paper🧵1/

3.0K