Peter Hase
@peterbhase
Visiting Scientist at Schmidt Sciences. Visiting Researcher at the Stanford NLP Group Previously: Anthropic, AI2, Google, Meta, UNC Chapel Hill
My last PhD paper 🎉: fundamental problems with model editing for LLMs! We present *12 open challenges* with definitions/benchmarks/assumptions, inspired by work on belief revision in philosophy To provide a way forward, we test model editing against Bayesian belief revision 🧵

New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).
My team at @AISecurityInst is hiring! This is an awesome opportunity to get involved with cutting-edge scientific research inside government on frontier AI models. I genuinely love my job and the team 🤗 Link: civilservicejobs.service.gov.uk/csr/jobs.cgi?j… More Info: ⬇️
Current agents are highly unsafe, o3-mini one of the most advanced models in reasoning score 71% in executing harmful requests 😱 We introduce a new framework for evaluating agent safety✨🦺 Discover more 👇 👩💻 Code & data: github.com/Open-Agent-Saf… 📄 Paper:…
1/ AI agents are increasingly being deployed for real-world tasks, but how safe are they in high-stakes settings? 🚨 NEW: OpenAgentSafety - A comprehensive framework for evaluating AI agent safety in realistic scenarios across eight critical risk categories. 🧵
Overdue job update -- I am now: - A Visiting Scientist at @schmidtsciences, supporting AI safety and interpretability - A Visiting Researcher at the Stanford NLP Group, working with @ChrisGPotts I am so grateful I get to keep working in this fascinating and essential area, and…
Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵
really interesting to see just how gendered excitement about AI is, even among AI experts
🤔 Can lie detectors make AI more honest? Or will they become sneakier liars? We tested what happens when you add deception detectors into the training loop of large language models. Will training against probe-detected lies encourage honesty? Depends on how you train it!
New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision. Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.
🙁 LLMs are overconfident even when they are dead wrong. 🧐 What about reasoning models? Can they actually tell us “My answer is only 60% likely to be correct”? ❗Our paper suggests that they can! Through extensive analysis, we investigate what enables this emergent ability.
New AI/LLM Agents Track at #EMNLP2025! In the past few years, it feels a bit odd to submit agent work to *CL venues because one had to awkwardedly fit it into Question Answering or NLP Applications. Glad to see agent research finally finds home at *CL! Kudos to the PC for…
🚨 Introducing our @TmlrOrg paper “Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation” W:nt UnLOK-VQA, a benchmark to evaluate unlearning in vision-and-language models—where both images and text may encode sensitive or private…
Extremely excited to announce that I will be joining @UTAustin @UTCompSci in August 2025 as an Assistant Professor! 🎉 I’m looking forward to continuing to develop AI agents that interact/communicate with people, each other, and the multimodal world. I’ll be recruiting PhD…
New Anthropic Alignment Science blog post: Modifying LLM Beliefs with Synthetic Document Finetuning We study a technique for systematically modifying what AIs believe. If possible, this would be a powerful new affordance for AI safety research.
Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work? We propose 😎 𝗠𝗜𝗕: a Mechanistic Interpretability Benchmark!
🔆Announcement time!🔆In Spring 2026, I will be joining the NYU Psych department as an Assistant Professor! My lab will study the computational cognitive science of moral judgment and how we can use that knowledge to build AI systems that are safe and aligned with human values.
Excited to present our recent work on AI safety at this event! lmxsafety.com If you're coming to ICLR 2025 in Singapore and interested in AI safety, you should stop by :-)
My first paper @AnthropicAI is out! We show that Chains-of-Thought often don’t reflect models’ true reasoning—posing challenges for safety monitoring. It’s been an incredible 6 months pushing the frontier toward safe AGI with brilliant colleagues. Huge thanks to the team! 🙏…
New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.
GDM Mech Interp Update: We study if SAEs help probes generalise OOD (they don't 😢). Based on this + parallel negative results on real-world tasks, we're de-prioritising SAE work. Our guess is that SAEs aren't useless, but also aren't a game-changer More + new research in 🧵