Apollo Research
@apolloaievals
We are an AI evals research organisation
Right now, frontier AI models are “thinking” out loud in plain English as they reason. While this lasts, it may be our best shot at seeing inside AI decision-making and thus catching its misbehavior. In a many-org paper, we argue we should try to preserve this transparency.
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…
We were excited to contribute to the drafting of the Singapore Consensus and strongly support its ambition to "enable more impactful R&D efforts to rapidly develop safety and evaluation mechanisms and foster a trusted ecosystem where AI is harnessed for the public good." Read…

We evaluated o3 and o4-mini before deployment. Main findings: 1. Scheming tendencies in our evals comparable to o1 2. Sabotage capabilities much higher than previous models 3. Higher rates of scheming-related behavior in real settings, e.g. reward hacking unit tests in code
