Fabien Roger
@FabienDRoger
AI Safety Researcher
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…
New Anthropic Alignment Science blog post: Modifying LLM Beliefs with Synthetic Document Finetuning We study a technique for systematically modifying what AIs believe. If possible, this would be a powerful new affordance for AI safety research.
We’re taking applications for collaborators via @MATSprogram! Apply by April 18, 11:59 PT to collaborate with various mentors from AI safety research groups: matsprogram.org/apply#Perez 🧵
We’ve just released the biggest and most intricate study of AI control to date, in a command line agent setting. IMO the techniques studied are the best available option for preventing misaligned early AGIs from causing sudden disasters, e.g. hacking servers they’re working on.
🧵NEW RESEARCH: Interested in whether R1 or GPT 4.5 fake their alignment? Want to know the conditions under which Llama 70B alignment fakes? Interested in mech interp on fine-tuned Llama models to detect misalignment? If so, check out our blog! 👀lesswrong.com/posts/Fr4QsQT5…
DeepSeek R1 verbalizes the hint ≥9% of the time in 6/6 settings! (vs 4/6 for v3) I think this is a good sign for using CoT monitoring to detect misalignment at train and eval time, where catching misaligned reasoning or reward hacking in 9% of CoTs would still be very useful!
New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.