Peter Hase

@peterbhase

Visiting Scientist at Schmidt Sciences. Visiting Researcher at the Stanford NLP Group Previously: Anthropic, AI2, Google, Meta, UNC Chapel Hill

New York, NY

Joined April 2019

995Following

3KFollowers

Pinned

Peter Hase@peterbhase · Jun 28, 2024

My last PhD paper 🎉: fundamental problems with model editing for LLMs! We present *12 open challenges* with definitions/benchmarks/assumptions, inspired by work on belief revision in philosophy To provide a way forward, we test model editing against Bayesian belief revision 🧵

peterbhase's tweet image. My last PhD paper 🎉: fundamental problems with model editing for LLMs!

We present *12 open challenges* with definitions/benchmarks/assumptions, inspired by work on belief revision in philosophy

To provide a way forward, we test model editing against Bayesian belief revision
🧵

306

190

54.0K

Pinned

Peter Hase Retweeted

Miles Turpin@milesaturpin · Jul 14

New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).

277

136

22.0K

Peter Hase Retweeted

Hannah Rose Kirk@hannahrosekirk · Jul 22

My team at @AISecurityInst is hiring! This is an awesome opportunity to get involved with cutting-edge scientific research inside government on frontier AI models. I genuinely love my job and the team 🤗 Link: civilservicejobs.service.gov.uk/csr/jobs.cgi?j… More Info: ⬇️

108

10.0K

Peter Hase@peterbhase · Jul 15

Current agents are highly unsafe, o3-mini one of the most advanced models in reasoning score 71% in executing harmful requests 😱 We introduce a new framework for evaluating agent safety✨🦺 Discover more 👇 👩‍💻 Code & data: github.com/Open-Agent-Saf… 📄 Paper:…

SSanidhya Vijayvargiya@sanidhya903 · Jul 15

1/ AI agents are increasingly being deployed for real-world tasks, but how safe are they in high-stakes settings? 🚨 NEW: OpenAgentSafety - A comprehensive framework for evaluating AI agent safety in realistic scenarios across eight critical risk categories. 🧵

10.0K

Peter Hase@peterbhase · Jul 14

Overdue job update -- I am now: - A Visiting Scientist at @schmidtsciences, supporting AI safety and interpretability - A Visiting Researcher at the Stanford NLP Group, working with @ChrisGPotts I am so grateful I get to keep working in this fascinating and essential area, and…

173

15.0K

Peter Hase Retweeted

Fazl Barez @ICML2025@FazlBarez · Jul 1

Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵

133

635

456

110.0K

Peter Hase Retweeted

Michel@JustenMichel · Jun 26

really interesting to see just how gendered excitement about AI is, even among AI experts

248

63.0K

Peter Hase Retweeted

FAR.AI@farairesearch · Jun 5

🤔 Can lie detectors make AI more honest? Or will they become sneakier liars? We tested what happens when you add deception detectors into the training loop of large language models. Will training against probe-detected lies encourage honesty? Depends on how you train it!

10.0K

Peter Hase Retweeted

Jiaxin Wen@jiaxinwen22 · Jun 11

New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision. Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.

154

1.0K

223.0K

Peter Hase Retweeted

Dongkeun Yoon@dongkeun_yoon · May 21

🙁 LLMs are overconfident even when they are dead wrong. 🧐 What about reasoning models? Can they actually tell us “My answer is only 60% likely to be correct”? ❗Our paper suggests that they can! Through extensive analysis, we investigate what enables this emergent ability.

299

266

31.0K

Peter Hase Retweeted

Yu Su@ysu_nlp · May 8

New AI/LLM Agents Track at #EMNLP2025! In the past few years, it feels a bit odd to submit agent work to *CL venues because one had to awkwardedly fit it into Question Answering or NLP Applications. Glad to see agent research finally finds home at *CL! Kudos to the PC for…

183

40.0K

Peter Hase Retweeted

Vaidehi Patil@vaidehi_patil_ · May 7

🚨 Introducing our @TmlrOrg paper “Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation” W:nt UnLOK-VQA, a benchmark to evaluate unlearning in vision-and-language models—where both images and text may encode sensitive or private…

101

10.0K

Peter Hase Retweeted

Elias Stengel-Eskin@EliasEskin · May 5

Extremely excited to announce that I will be joining @UTAustin @UTCompSci in August 2025 as an Assistant Professor! 🎉 I’m looking forward to continuing to develop AI agents that interact/communicate with people, each other, and the multimodal world. I’ll be recruiting PhD…

448

48.0K

Peter Hase Retweeted

rowan@rowankwang · Apr 24

New Anthropic Alignment Science blog post: Modifying LLM Beliefs with Synthetic Document Finetuning We study a technique for systematically modifying what AIs believe. If possible, this would be a powerful new affordance for AI safety research.

350

282

72.0K

Peter Hase Retweeted

Aaron Mueller@amuuueller · Apr 23

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work? We propose 😎 𝗠𝗜𝗕: a Mechanistic Interpretability Benchmark!

171

28.0K

Peter Hase Retweeted

Sydney Levine@sydneymlevine · Apr 7

🔆Announcement time!🔆In Spring 2026, I will be joining the NYU Psych department as an Assistant Professor! My lab will study the computational cognitive science of moral judgment and how we can use that knowledge to build AI systems that are safe and aligned with human values.

266

21.0K

Peter Hase Retweeted

Maksym Andriushchenko@maksym_andr · Apr 5

Excited to present our recent work on AI safety at this event! lmxsafety.com If you're coming to ICLR 2025 in Singapore and interested in AI safety, you should stop by :-)

104

6.0K

Peter Hase@peterbhase · Apr 3

My first paper @AnthropicAI is out! We show that Chains-of-Thought often don’t reflect models’ true reasoning—posing challenges for safety monitoring. It’s been an incredible 6 months pushing the frontier toward safe AGI with brilliant colleagues. Huge thanks to the team! 🙏…

AAnthropic@AnthropicAI · Apr 3

New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.

1.0K

365

86.0K

Peter Hase Retweeted

Neel Nanda@NeelNanda5 · Mar 26

GDM Mech Interp Update: We study if SAEs help probes generalise OOD (they don't 😢). Based on this + parallel negative results on real-world tasks, we're de-prioritising SAE work. Our guess is that SAEs aren't useless, but also aren't a game-changer More + new research in 🧵

827

482

252.0K