Summer Yue

@summeryue0

VP of Research at Scale AI. Prev: RLHF lead on Bard / Gemini, researcher at Google DeepMind / Brain (LaMDA, RL / TFAgents, AlphaChip).

San Francisco, CA

Joined August 2014

364Following

6KFollowers

Pinned

Summer Yue@summeryue0 · Jul 7

Today is my first day at Meta Superintelligence Labs. I’ll be focusing on alignment and safety, building on my time at Scale Research and SEAL. Grateful to keep working with @alexandr_wang—no one more committed, clear-eyed, or mission-driven. Excited for what’s ahead 🚀

2.0K

223

210.0K

Summer Yue@summeryue0 · Jun 10

🔍 SEAL and Red Team at @scale_ai present a position paper outlining what we’ve learned from red teaming LLMs so far—what matters, what’s missing, and how model safety fits into broader system safety and monitoring. 🔗 scale.com/research/red_t… 📝 scale.com/blog/rethink-r…

ZZifan (Sail) Wang@_zifan_wang · Jun 10

🧵 (1/6) Bringing together diverse mindsets – from in-the-trenches red teamers to ML & policy researchers, we write a position paper arguing crucial research priorities for red teaming frontier models, followed by a roadmap towards system-level safety, AI monitoring, and…

112

64.0K

Summer Yue@summeryue0 · Apr 17

5.0K

Summer Yue@summeryue0 · Apr 17

🤖 AI agents are crossing into the real world. But when they act independently—who’s watching? At Scale, we’re building Agent Oversight: a platform to monitor, intervene, and align autonomous AI. We’re hiring engineers (SF/NYC) to tackle one of the most urgent problems in AI.…

6.0K

Summer Yue@summeryue0 · Apr 9

If a model lies when pressured—it’s not ready for AGI. The new MASK leaderboard is live. Built on the private split of our open-source honesty benchmark (w/ @ai_risks), it tests whether models lie under pressure—even when they know better. 📊 Leaderboard:…

summeryue0's tweet image. If a model lies when pressured—it’s not ready for AGI.

The new MASK leaderboard is live.

Built on the private split of our open-source honesty benchmark (w/ @ai_risks), it tests whether models lie under pressure—even when they know better.

📊 Leaderboard:…

7.0K

Summer Yue@summeryue0 · Mar 25

Impressive results! Congrats to the Gemini team.

AAlexandr Wang@alexandr_wang · Mar 25

🚨 Gemini 2.5 Pro Exp dropped and it's now #1 across SEAL leaderboards: 🥇 Humanity’s Last Exam 🥇 VISTA (multimodal) 🥇 (tie) Tool Use 🥇 (tie) MultiChallenge (multi-turn) 🥉 (tie) Enigma (puzzles) Congrats to @demishassabis @sundarpichai & team! 🔗 scale.com/leaderboard

2.0K

Summer Yue@summeryue0 · Feb 28

GPT-4.5 Preview Just Dropped~ We put it to the test, and the results are... mixed 👀 ⚡ #2 in Tool Use - Chat (trailing o1) 🏢 #3 in Tool Use - Enterprise (coming after Claude 3.7 Sonnet) 🥉 #3 in EnigmaEval (following Claude 3.7 Sonnet Thinking) 📚 #4 in MultiChallenge (behind…

summeryue0's tweet image. GPT-4.5 Preview Just Dropped~ We put it to the test, and the results are... mixed 👀

⚡ #2 in Tool Use - Chat (trailing o1)
🏢 #3 in Tool Use - Enterprise (coming after Claude 3.7 Sonnet)
🥉 #3 in EnigmaEval (following Claude 3.7 Sonnet Thinking)
📚 #4 in MultiChallenge (behind…

2.0K

Summer Yue Retweeted

Alexandr Wang@alexandr_wang · Feb 16

On the heels of Humanity's Last Exam, @scale_AI & @ai_risks have released a new very-hard reasoning eval: EnigmaEval: 1,184 multimodal puzzles so hard they take groups of humans many hours to days to solve. All top models score 0 on the Hard set, and <10% on the Normal set 🧵

119

1.0K

439

220.0K

Summer Yue@summeryue0 · Feb 11

Excited to share our latest work “Jailbreaking to Jailbreak (J2)”, from the SEAL team and @scale_AI's Red Team! As frontier models become more creative and capable of reasoning, they can now not only assist human red teamers but also autonomously drive red teaming efforts.…

ZZifan (Sail) Wang@_zifan_wang · Feb 11

🧵 1/N) Excited to share our recent work at @scale_AI, "Jailbreaking to Jailbreak (J2)".😈 We present a novel LLM-as-red-teamer approach in which a human jailbreaks a refusal-trained LLM to make it willing to jailbreak itself or other LLMs. We refer to this process as…

4.0K

Summer Yue@summeryue0 · Feb 5

✨All Gemini 2.0 models are now on MultiChallenge! Pro Experimental, Flash, and Flash Thinking have joined the benchmark - with Pro Experimental ranking #3! 🎯

SSummer Yue@summeryue0 · Feb 5

Introducing MultiChallenge by @scale_AI - a new multi-turn conversation benchmark. Current frontier LLMs score under 50% accuracy (top: 44.93%). The new Gemini 2.0 Flash model launched today has also been included to our SEAL leaderboard. 📄 Paper: arxiv.org/abs/2501.17399…

9.0K