Julian Michael
@_julianmichael_
As AIs improve at persuasion & argumentation, how do we ensure that they help us seek truth vs. just sounding convincing? In human experiments, we validate debate as a truth-seeking process, showing that it may soon be needed for supervising AI. Paper: github.com/julianmichael/…

New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).
New faithfulness paper! How do we get models to actually explain their reasoning? I think this basically doesn’t happen in CoT by default, and it’s hard to figure out what this should look like in the first place, but even basic techniques show some promise :) see the paper!
New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).
We design AIs to be oracles and servants, and then we’re aghast when they read the conversation history and decide we’re narcissists. What exactly did we expect? Then we “solve” this by having AI treat us as narcissists out of the gate? Seems like a move in the wrong direction.
When we were first shipping Memory, the initial thought was: “Let’s let users see and edit their profiles”. Quickly learned that people are ridiculously sensitive: “Has narcissistic tendencies” - “No I do not!”, had to hide it. Hence this batch of the extreme sycophancy RLHF.
I’m excited to share details about HCAST (Human-Calibrated Autonomy Software Tasks), a benchmark we’ve been developing at METR for the past year to measure the abilities of frontier AI systems to complete diverse software tasks autonomously.
To accelerate AI adoption, we need an AI standard. What Moody’s is for bonds, FICO for credit, SOC 2 for security. Standards offer credible signals of who to trust. They create confidence. Confidence accelerates adoption. Introducing AIUC-1: the world’s first AI agent standard
I was pretty skeptical that this study was worth running, because I thought that *obviously* we would see significant speedup. x.com/METR_Evals/sta…
We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.
Today is my first day at Meta Superintelligence Labs. I’ll be focusing on alignment and safety, building on my time at Scale Research and SEAL. Grateful to keep working with @alexandr_wang—no one more committed, clear-eyed, or mission-driven. Excited for what’s ahead 🚀
I should probably announce that a few months ago, I joined @scale_AI to lead the Safety, Evaluations, and Alignment Lab… and today, I joined @Meta to continue working on AI alignment with @summeryue0 and @alexandr_wang. Very excited for what we can accomplish together!
New adversarial robustness benchmark with harm categories grounded in US and international law!
🧵 (1/5) Powerful LLMs present dual-use opportunities & risks for national security and public safety (NSPS). We are excited to launch FORTRESS, a new SEAL leaderboard for measuring adversarial robustness of model safeguard and over-refusal tailored particularly for NSPS threats.
🧵 (1/5) Powerful LLMs present dual-use opportunities & risks for national security and public safety (NSPS). We are excited to launch FORTRESS, a new SEAL leaderboard for measuring adversarial robustness of model safeguard and over-refusal tailored particularly for NSPS threats.
Read our new position paper on making red teaming research relevant for real systems 👇
🧵 (1/6) Bringing together diverse mindsets – from in-the-trenches red teamers to ML & policy researchers, we write a position paper arguing crucial research priorities for red teaming frontier models, followed by a roadmap towards system-level safety, AI monitoring, and…
🧵 (1/6) Bringing together diverse mindsets – from in-the-trenches red teamers to ML & policy researchers, we write a position paper arguing crucial research priorities for red teaming frontier models, followed by a roadmap towards system-level safety, AI monitoring, and…
Is GPQA Diamond tapped out? Recent top scores have clustered around 83%. Could the other 17% of questions be flawed? In this week’s Gradient Update, @GregHBurnham digs into this popular benchmark. His conclusion: reports of its demise are probably premature.
How robust is our AI oversight? 🤔 I just published my MATS 5.0 project, where I explore oversight robustness by training an LLM to give CodeNames clues a bunch of interesting ways and measure how much it reward hacks. Link in thread!
On the contrary: poisoning human <-> AI trust is good Even though this wasn't OpenAI's intention, grotesquely sycophantic models are ultimately useful for getting everyone to really 'get it': People shouldn't trust AI outputs unconditionally – all models are sycophantic
openai is single-handedly poisoning the well of human <-> AI trust and wordsmithing. we spent months creating an experience that tries to actually help people now we face an uphill battle because trust has been destroyed. it isn’t coming back, even when 4o is “fixed”. it’s gone
🤖 AI agents are crossing into the real world. But when they act independently—who’s watching? At Scale, we’re building Agent Oversight: a platform to monitor, intervene, and align autonomous AI. We’re hiring engineers (SF/NYC) to tackle one of the most urgent problems in AI.…
We’re taking applications for collaborators via @MATSprogram! Apply by April 18, 11:59 PT to collaborate with various mentors from AI safety research groups: matsprogram.org/apply#Perez 🧵
If a model lies when pressured—it’s not ready for AGI. The new MASK leaderboard is live. Built on the private split of our open-source honesty benchmark (w/ @ai_risks), it tests whether models lie under pressure—even when they know better. 📊 Leaderboard:…
Exciting that @scale_AI is sponsoring Agent Workshop at CMU in April. Students and researchers who work on agents feel free to visit CMU to present your work! I will also be traveling to Pittsburgh to share my recent focuses on agents, both capability and safety.
📢 Join us at the CMU Agent Workshop 2025, April 10-11! Don't miss our esteemed invited speakers: - Qingyun Wu (PSU) - Diyi Yang (Stanford) - Aviral Kumar (CMU) - Graham Neubig (CMU) ...and many more to come! To register, visit: cmu-agent-workshop.github.io