Mikita Balesni 🇺🇦
@balesni
Working on risks from rogue AI @apolloaievals Past: Reversal curse, Out-of-context reasoning // best way to support 🇺🇦 https://savelife.in.ua/en/donate-en/
AI control evaluations have assumed the red team to have almost "no holds barred", making these evals overly conservative and costly for near-future models. In a new paper, we propose changes to control evals informed by model capabilities to subvert control measures.
LLM agents might cause serious harm if they start pursuing misaligned goals. In our new paper, we show how to use capability evals in helping determine which control measures (e.g. monitoring) are sufficient to ensure that an agent can be deployed safely.
I'm not much of a model internals interpretability hater, and am excited for more people to work on it (both near-term applications like detecting when models are evaluation-aware, and ambitious interp / decompiling models). But reading CoT is so much more useful right now!
When models start reasoning step-by-step, we suddenly get a huge safety gift: a window into their thought process. We could easily lose this if we're not careful. We're publishing a paper urging frontier labs: please don't train away this monitorability. Authored and endorsed…
I am grateful to have worked closely with @tomekkorbak, @balesni, @rohinmshah and Vlad Mikulik on this paper, and I am very excited that researchers across many prominent AI institutions collaborated with us and came to consensus around this important direction.
The holy grail of AI safety has always been interpretability. But what if reasoning models just handed it to us in a stroke of serendipity? In our new paper, we argue that the AI community should turn this serendipity into a systematic AI safety agenda!🛡️
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…
I reimplemented the bliss attractor eval from Claude 4 System Card. It's fascinating how LLMs reliably fall into attractor basins of their pet obsessions, how different these attractors across LLMs, and how they say something non-trivial about LLMs' personalities. 🌀🌀🌀
The initial explainer post was very weak IMO; but I am very glad to see this expansion
i strongly encourage everyone to read this blog post very detailed explanation of our posttraining, process, and what we’re changing to do better link below
If you want to do AI safety work but lack funding, here’s a great opportunity to get it:
Tomorrow is the last day to apply to Open Phil for grant money for an AI/ML project in interpretability, alignment and all the other areas of AI safety in our RFP (see next post). We are very open to applications from non-professors who wish they could run bigger experiments!
I'm bad at context switching and, when delegating to @cursor_ai agent, often find myself forgetting to check if it's done and not remembering what exactly I asked for. So a wrote a small MCP server to fix that.
Sonnet 3.7 often realises it's being eval'd on alignment e.g. sandbagging. It starts to give misaligned, self-preserving answers but then "Actually, let me reconsider" and gives "correct"/"aligned" answers *because it thinks it's being tested*
AI models – especially Claude Sonnet 3.7 – often realize when they’re being evaluated for alignment. Here’s an example of Claude's reasoning during a sandbagging evaluation, where it learns from documentation that it will not be deployed if it does well on a biology test: