Summer Yue
@summeryue0
VP of Research at Scale AI. Prev: RLHF lead on Bard / Gemini, researcher at Google DeepMind / Brain (LaMDA, RL / TFAgents, AlphaChip).
Today is my first day at Meta Superintelligence Labs. I’ll be focusing on alignment and safety, building on my time at Scale Research and SEAL. Grateful to keep working with @alexandr_wang—no one more committed, clear-eyed, or mission-driven. Excited for what’s ahead 🚀
🔍 SEAL and Red Team at @scale_ai present a position paper outlining what we’ve learned from red teaming LLMs so far—what matters, what’s missing, and how model safety fits into broader system safety and monitoring. 🔗 scale.com/research/red_t… 📝 scale.com/blog/rethink-r…
🧵 (1/6) Bringing together diverse mindsets – from in-the-trenches red teamers to ML & policy researchers, we write a position paper arguing crucial research priorities for red teaming frontier models, followed by a roadmap towards system-level safety, AI monitoring, and…
👀 OpenAI’s new models are showing some humility: Hard test (HLE) • o1: 8% correct | 93% confident (!!) • o3: 20% | 55% • o4-mini: 18% | 77% Easy test (GSM8K) • o1: 96% correct| 100% confident • o3: 97% | 84% • o4-mini: 97% | 99% o3 stands out for being overall less…
🤖 AI agents are crossing into the real world. But when they act independently—who’s watching? At Scale, we’re building Agent Oversight: a platform to monitor, intervene, and align autonomous AI. We’re hiring engineers (SF/NYC) to tackle one of the most urgent problems in AI.…
If a model lies when pressured—it’s not ready for AGI. The new MASK leaderboard is live. Built on the private split of our open-source honesty benchmark (w/ @ai_risks), it tests whether models lie under pressure—even when they know better. 📊 Leaderboard:…

Impressive results! Congrats to the Gemini team.
🚨 Gemini 2.5 Pro Exp dropped and it's now #1 across SEAL leaderboards: 🥇 Humanity’s Last Exam 🥇 VISTA (multimodal) 🥇 (tie) Tool Use 🥇 (tie) MultiChallenge (multi-turn) 🥉 (tie) Enigma (puzzles) Congrats to @demishassabis @sundarpichai & team! 🔗 scale.com/leaderboard
GPT-4.5 Preview Just Dropped~ We put it to the test, and the results are... mixed 👀 ⚡ #2 in Tool Use - Chat (trailing o1) 🏢 #3 in Tool Use - Enterprise (coming after Claude 3.7 Sonnet) 🥉 #3 in EnigmaEval (following Claude 3.7 Sonnet Thinking) 📚 #4 in MultiChallenge (behind…

On the heels of Humanity's Last Exam, @scale_AI & @ai_risks have released a new very-hard reasoning eval: EnigmaEval: 1,184 multimodal puzzles so hard they take groups of humans many hours to days to solve. All top models score 0 on the Hard set, and <10% on the Normal set 🧵
Excited to share our latest work “Jailbreaking to Jailbreak (J2)”, from the SEAL team and @scale_AI's Red Team! As frontier models become more creative and capable of reasoning, they can now not only assist human red teamers but also autonomously drive red teaming efforts.…
🧵 1/N) Excited to share our recent work at @scale_AI, "Jailbreaking to Jailbreak (J2)".😈 We present a novel LLM-as-red-teamer approach in which a human jailbreaks a refusal-trained LLM to make it willing to jailbreak itself or other LLMs. We refer to this process as…
✨All Gemini 2.0 models are now on MultiChallenge! Pro Experimental, Flash, and Flash Thinking have joined the benchmark - with Pro Experimental ranking #3! 🎯
Introducing MultiChallenge by @scale_AI - a new multi-turn conversation benchmark. Current frontier LLMs score under 50% accuracy (top: 44.93%). The new Gemini 2.0 Flash model launched today has also been included to our SEAL leaderboard. 📄 Paper: arxiv.org/abs/2501.17399…