John (Yueh-Han) Chen
@jcyhc_ai
AI Safety Research @nyuniversity | MATS Scholar | Prev @UCBerkeley
Do LLMs show systematic generalization of safety facts to novel scenarios? Introducing our work SAGE-Eval, a benchmark consisting of 100+ safety facts and 10k+ scenarios to test this! - Claude-3.7-Sonnet passes only 57% of facts evaluated - o1 and o3-mini passed <45%! 🧵

There are many great researchers out there. But the ones that really stand out to me are the ones who are also kind, even when they don't need to be.
Maybe don't use an LLM for _everything_? Last summer, I got to fiddle again with content diversity @AdobeResearch @Adobe and we showed that agentic pipelines that mix LLM-prompt steps with principled techniques can yield better, more personalized summaries
We made a simple tool to help protect your dataset from being trained on. Within 30 mins and for $0, you can set up a Turnstile-protected download portal with canaries reversibly inserted into your data. Helps reduce training leakage. (1/n) turntrout.com/dataset-protec…
John has some nice new results showing that some frontier models do worse on our safety benchmark than their predecessors. Take a look!
New SAGE-Eval results: Both o3 and Claude-sonnet-4 underperformed(!) their previous generations (o3 vs. o1, Claude-4 vs. Claude-3.7). ➟ Stronger models are not always safer! Gemini-2.5-pro is No.1, but it only passed 72% of the safety facts. Still lots of room for improvement.
Modern reasoning models think in plain English. Monitoring their thoughts could be a powerful, yet fragile, tool for overseeing future AI systems. I and researchers across many organizations think we should work to evaluate, preserve, and even improve CoT monitorability.
Do reasoning models like DeepSeek R1 learn their behavior from scratch? No! In our new paper, we extract steering vectors from a base model that induce backtracking in a distilled reasoning model, but surprisingly have no apparent effect on the base model itself! 🧵 (1/5)
This is the most scaling-pilled project I've ever been part of, and the team really cooked. TL;DR: With RL and inference scaling, Gemini perfectly solved 5 out of 6 problems, reaching a gold medal in IMO '25, all within the time constraints of 4.5hr.
An advanced version of Gemini with Deep Think has officially achieved gold medal-level performance at the International Mathematical Olympiad. 🥇 It solved 5️⃣ out of 6️⃣ exceptionally difficult problems, involving algebra, combinatorics, geometry and number theory. Here’s how 🧵
The holy grail of AI safety has always been interpretability. But what if reasoning models just handed it to us in a stroke of serendipity? In our new paper, we argue that the AI community should turn this serendipity into a systematic AI safety agenda!🛡️
The holy grail of AI safety has always been interpretability. But what if reasoning models just handed it to us in a stroke of serendipity? In our new paper, we argue that the AI community should turn this serendipity into a systematic AI safety agenda!🛡️
The holy grail of AI safety has always been interpretability. But what if reasoning models just handed it to us in a stroke of serendipity? In our new paper, we argue that the AI community should turn this serendipity into a systematic AI safety agenda!🛡️
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…
Introducing: Full-Stack Alignment 🥞 A research program dedicated to co-aligning AI systems *and* institutions with what people value. It's the most ambitious project I've ever undertaken. Here's what we're doing: 🧵
I was pretty skeptical that this study was worth running, because I thought that *obviously* we would see significant speedup. x.com/METR_Evals/sta…
We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.
Our Coconut work (learning continuous latent CoT) has opened sourced now. Welcome to play with it: github.com/facebookresear…
Reward models (RMs) are the moral compass of LLMs – but no one has x-rayed them at scale. We just ran the first exhaustive analysis of 10 leading RMs, and the results were...eye-opening. Wild disagreement, base-model imprint, identity-term bias, mere-exposure quirks & more: 🧵