Mikita Balesni 🇺🇦

@balesni

Working on risks from rogue AI @apolloaievals Past: Reversal curse, Out-of-context reasoning // best way to support 🇺🇦 https://savelife.in.ua/en/donate-en/

London

Joined June 2013

616Following

752Followers

Pinned

Mikita Balesni 🇺🇦@balesni · Apr 14

AI control evaluations have assumed the red team to have almost "no holds barred", making these evals overly conservative and costly for near-future models. In a new paper, we propose changes to control evals informed by model capabilities to subvert control measures.

TTomek Korbak@tomekkorbak · Apr 14

LLM agents might cause serious harm if they start pursuing misaligned goals. In our new paper, we show how to use capability evals in helping determine which control measures (e.g. monitoring) are sufficient to ensure that an agent can be deployed safely.

1.0K

Mikita Balesni 🇺🇦@balesni · Jul 16

I'm not much of a model internals interpretability hater, and am excited for more people to work on it (both near-term applications like detecting when models are evaluation-aware, and ambitious interp / decompiling models). But reading CoT is so much more useful right now!

TTomek Korbak@tomekkorbak · Jul 16

702

Mikita Balesni 🇺🇦 Retweeted

Wojciech Zaremba@woj_zaremba · Jul 15

When models start reasoning step-by-step, we suddenly get a huge safety gift: a window into their thought process. We could easily lose this if we're not careful. We're publishing a paper urging frontier labs: please don't train away this monitorability. Authored and endorsed…

196

19.0K

Mikita Balesni 🇺🇦 Retweeted

Bowen Baker@bobabowen · Jul 15

I am grateful to have worked closely with @tomekkorbak, @balesni, @rohinmshah and Vlad Mikulik on this paper, and I am very excited that researchers across many prominent AI institutions collaborated with us and came to consensus around this important direction.

3.0K

Mikita Balesni 🇺🇦@balesni · Jul 15

The holy grail of AI safety has always been interpretability. But what if reasoning models just handed it to us in a stroke of serendipity? In our new paper, we argue that the AI community should turn this serendipity into a systematic AI safety agenda!🛡️

MMikita Balesni 🇺🇦@balesni · Jul 15

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…

8.0K

Mikita Balesni 🇺🇦 Retweeted

Tomek Korbak@tomekkorbak · Jun 5

I reimplemented the bliss attractor eval from Claude 4 System Card. It's fascinating how LLMs reliably fall into attractor basins of their pet obsessions, how different these attractors across LLMs, and how they say something non-trivial about LLMs' personalities. 🌀🌀🌀

170

33.0K

Mikita Balesni 🇺🇦@balesni · May 4

The initial explainer post was very weak IMO; but I am very glad to see this expansion

AAidan McLaughlin@aidan_mclau · May 2

i strongly encourage everyone to read this blog post very detailed explanation of our posttraining, process, and what we’re changing to do better link below

337

Mikita Balesni 🇺🇦@balesni · Apr 14

If you want to do AI safety work but lack funding, here’s a great opportunity to get it:

MMax Nadeau@MaxNadeau_ · Apr 14

Tomorrow is the last day to apply to Open Phil for grant money for an AI/ML project in interpretability, alignment and all the other areas of AI safety in our RFP (see next post). We are very open to applications from non-professors who wish they could run bigger experiments!

247

Mikita Balesni 🇺🇦 Retweeted

Tomek Korbak@tomekkorbak · Apr 10

I'm bad at context switching and, when delegating to @cursor_ai agent, often find myself forgetting to check if it's done and not remembering what exactly I asked for. So a wrote a small MCP server to fix that.

499

Mikita Balesni 🇺🇦@balesni · Mar 17

Sonnet 3.7 often realises it's being eval'd on alignment e.g. sandbagging. It starts to give misaligned, self-preserving answers but then "Actually, let me reconsider" and gives "correct"/"aligned" answers *because it thinks it's being tested*

AApollo Research@apolloaievals · Mar 17

AI models – especially Claude Sonnet 3.7 – often realize when they’re being evaluated for alignment. Here’s an example of Claude's reasoning during a sandbagging evaluation, where it learns from documentation that it will not be deployed if it does well on a biology test:

108

6.0K