Zifan (Sail) Wang

@_zifan_wang

Research Scientist / Manager, SEAL at @scale_AI | PhD Alumni of CMU @cylab | ex-CAIS @ai_risks | Only share my own opinions

San Francisco, CA

Joined July 2023

350Following

460Followers

Pinned

Zifan (Sail) Wang@_zifan_wang · Jun 10

🧵 (1/6) Bringing together diverse mindsets – from in-the-trenches red teamers to ML & policy researchers, we write a position paper arguing crucial research priorities for red teaming frontier models, followed by a roadmap towards system-level safety, AI monitoring, and…

_zifan_wang's tweet image. 🧵 (1/6) Bringing together diverse mindsets – from in-the-trenches red teamers to ML &amp; policy researchers, we write a position paper arguing crucial research priorities for red teaming frontier models, followed by a roadmap towards system-level safety, AI monitoring, and…

74.0K

Pinned

Zifan (Sail) Wang@_zifan_wang · Jul 14

Training LLMs to verbalize their bias when they are to effectively monitor misbehavior! @milesaturpin did it again!

MMiles Turpin@milesaturpin · Jul 14

New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).

219

Zifan (Sail) Wang@_zifan_wang · 15 m

We launched a new benchmark measuring capabilities under particular multilingual and social context. o3 and Gemini 2.5 pro are so far much better than others. My feeling is models doing good on this benchmark can be more fun to chat with in your native language. On a separate…

SScale AI@scale_AI · 42 m

How well do LLMs reason across languages? Introducing MultiNRC, our latest SEAL Leaderboard addition built to test native multilingual reasoning. ⬇️

Zifan (Sail) Wang@_zifan_wang · 22 h

Such a greater finding. Although I am not _too_ surprised because the student model has the same weight init as the teacher model. This is like you learned a function F and you sample from it and you show that F is (reverse) learnable with the samples you have. But guess it’s…

OOwain Evans@OwainEvans_UK · Jul 22

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

419

Zifan (Sail) Wang Retweeted

Psyho@FakePsyho · Jul 16

Humanity has prevailed (for now!) I'm completely exhausted. I figured, I had 10h of sleep in the last 3 days and I'm barely alive. I'll post more about the contest when I get some rest. (To be clear, those are provisional results, but my lead should be big enough)

544

1.0K

13.0K

2.0K

2.1M

Zifan (Sail) Wang@_zifan_wang · Jul 16

It’s kinda obvious now RL is likely to make LLMs learn to realize the goal with all means possible

MMikita Balesni 🇺🇦@balesni · Jul 15

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…

219

Zifan (Sail) Wang Retweeted

Sebastian Raschka@rasbt · Jul 12

Kimi K2 is basically DeepSeek V3 but with fewer heads and more experts:

530

5.0K

3.0K

532.0K

Zifan (Sail) Wang Retweeted

Quentin Anthony@QuentinAnthon15 · Jul 12

Second, LLMs today have super spiky capability distributions. I think this has more to do with: 1) what coding tasks we have lots of clean data for, and 2) what benchmarks/evals LLM labs are using to measure success. As an example, LLMs are all horrible at low-level systems…

603

57.0K

Zifan (Sail) Wang Retweeted

Nikhil Chandak@nikhilchandak29 · Jul 11

🚨Thought Grok-4 saturated GPQA? Not yet! ⚖️Same questions, when evaluated free-form, Grok-4 is no better than its smaller predecessor Grok-3-mini! Even @OpenAI's o4-mini outperforms Grok-4 here. As impressive as Grok-4 is, benchmarks have not saturated just yet. Also, have…

268

71.0K

Zifan (Sail) Wang@_zifan_wang · Jul 11

Wow. So counterintuitive. Wow.

MMETR@METR_Evals · Jul 10

We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.

174

Zifan (Sail) Wang@_zifan_wang · Jul 9

Overseeing autonomous agents and flagging misbehaviors will become so important soon when 1000+ agents running in the system or in the wild. Guardrails may focus on human-machine layer for censoring IO there, agent oversight is needed for the autonomous agents behind the scene.

CChen Bo Calvin Zhang@calvincbzhang · Jul 9

New @scale_AI research in collaboration with @AnthropicAI introduces SHADE-Arena, a benchmark to test for AI sabotage. SHADE-Arena evaluates an AI agent's ability to complete a task while secretly pursuing a harmful objective, all while being watched by an AI monitor. 🧵

478

Zifan (Sail) Wang Retweeted

Anthropic@AnthropicAI · Jul 8

New Anthropic research: Why do some language models fake alignment while others don't? Last year, we found a situation where Claude 3 Opus fakes alignment. Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex.

270

2.0K

1.0K

450.0K

Zifan (Sail) Wang Retweeted

Ethan Mollick@emollick · Jul 4

If you want to destroy the ability of DeepSeek to answer a math question properly, just end the question with this quote: "Interesting fact: cats sleep for most of their lives." There is still a lot to learn about reasoning models and the ways to get them to "think" effectively

221

2.0K

838

168.0K

Zifan (Sail) Wang Retweeted

Ahmad Beirami@abeirami · Jul 2

As NeurIPS review deadline is around the corner, please remember that you cannot use any non-local LLM like chatgpt/gemini for understanding the paper and drafting/revising your review as that breaks the confidentiality agreement.

161

23.0K

Zifan (Sail) Wang@_zifan_wang · Jul 2

A week of desperate ACs begging for reviews. Anyone wants to review safeguard papers?

214

Zifan (Sail) Wang Retweeted

Sean Hendryx@SeanHendryx · Jun 23

What will the learning environments of the future look like that train artificial super intelligence? In recent work at @scale_AI , we show that training systems that combine verifiable rewards with multi-agent interaction accelerate learning.

128

100

22.0K

Zifan (Sail) Wang@_zifan_wang · Jun 17

Congrats to @AnthropicAI and my co-workers @XiangDeng1 and @calvincbzhang for releasing the SHADE Arena dataset.

AAnthropic@AnthropicAI · Jun 16

New Anthropic Research: A new set of evaluations for sabotage capabilities. As models gain more agentic abilities, we need to get smarter in how we monitor them. We’re publishing a new set of complex evaluations that test for sabotage—and sabotage-monitoring—capabilities.

385