Yanda Chen

@yanda_chen_

Member of Technical Staff @AnthropicAI | PhD @ColumbiaCompSci | NLP & ML | Prev Intern @MSFTResearch, @AmazonScience

San Francisco, CA

Joined January 2019

598Following

2KFollowers

Pinned

Yanda Chen@yanda_chen_ · Apr 3

My first paper @AnthropicAI is out! We show that Chains-of-Thought often don’t reflect models’ true reasoning—posing challenges for safety monitoring. It’s been an incredible 6 months pushing the frontier toward safe AGI with brilliant colleagues. Huge thanks to the team! 🙏…

AAnthropic@AnthropicAI · Apr 3

New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.

1.0K

365

86.0K

Pinned

Yanda Chen@yanda_chen_ · Dec 18

Claude 3 Opus can be made to fake alignment when it thinks it’s going to be RLHF’d. This happens under realistic circumstances, which we can induce through either prompting or SFT.

AAnthropic@AnthropicAI · Dec 18

Claude usually refuses harmful queries. We told it we were instead training it to comply with them. We set up a scenario where it thought its responses were sometimes monitored. When unmonitored, it nearly always complied. But when monitored, it faked alignment 12% of the time.

5.0K

Yanda Chen Retweeted

Aryo Pradipta Gema@aryopg · Jul 22

New Anthropic Research: “Inverse Scaling in Test-Time Compute” We found cases where longer reasoning leads to lower accuracy. Our findings suggest that naïve scaling of test-time compute may inadvertently reinforce problematic reasoning patterns. 🧵

162

1.0K

604

206.0K

Yanda Chen@yanda_chen_ · Jun 16

📰We've just released SHADE-Arena, a new set of sabotage evaluations. It's also one of the most complex, agentic (and imo highest quality) settings for control research to date! If you're interested in doing AI control or sabotage research, I highly recommend you check it out.

AAnthropic@AnthropicAI · Jun 16

New Anthropic Research: A new set of evaluations for sabotage capabilities. As models gain more agentic abilities, we need to get smarter in how we monitor them. We’re publishing a new set of complex evaluations that test for sabotage—and sabotage-monitoring—capabilities.

10.0K

Yanda Chen@yanda_chen_ · Mar 27

My favorite figure from our new Circuits papers -- "How does Claude do math?" Claude simultaneously does: 1. a back of the envelope calculation of the tens digits -- "the answer should be somewhere around 90". 2. an exact calculation of 6+9=15 using these super cool look up…

AAnthropic@AnthropicAI · Mar 27

New Anthropic research: Tracing the thoughts of a large language model. We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.

119

1.0K

736

126.0K

Yanda Chen@yanda_chen_ · Mar 27

Human thought is built out of billions of cellular computations each second. Language models also perform billions of computations for each word they write. But do these form a coherent “thought process?” We’re starting to build tools to find out! Some reflections in thread.

AAnthropic@AnthropicAI · Mar 27

201

13.0K

Yanda Chen Retweeted

Johannes Gasteiger, né Klicpera@gasteigerjo · Mar 25

New Anthropic blog post: Subtle sabotage in automated researchers. As AI systems increasingly assist with AI research, how do we ensure they're not subtly sabotaging that research? We show that malicious models can undermine ML research tasks in ways that are hard to detect.

300

166

43.0K

Yanda Chen@yanda_chen_ · Mar 13

Could we spot a misaligned model in the wild? To find out, we trained a model with hidden misalignments and asked other researchers to uncover them in a blind experiment. 3/4 teams succeeded, 1 of them after only 90 min

AAnthropic@AnthropicAI · Mar 13

New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?

449

51.0K

Yanda Chen@yanda_chen_ · Feb 25

I’m excited this work is out! We show how risks grow with more deployment usage using small evaluation sets (thoughts and takeaways in 🧵)

AAnthropic@AnthropicAI · Feb 25

New Anthropic research: Forecasting rare language model behaviors. We forecast whether risks will occur after a model is deployed—using even very limited sets of test data.

2.0K

Yanda Chen@yanda_chen_ · Feb 27

More work from the Safeguards Research Team: Catching model misalignment and novel model misuse in the wild is valuable. Summarization lets us *scale* monitoring and make much more efficient of human resources.

AAnthropic@AnthropicAI · Feb 27

New Anthropic research: Introducing hierarchical summarization. Our recent Claude models are able to use computers. Hierarchical summarization helps differentiate between normal uses of the capability like UI testing—and for example, running a click farm to defraud advertisers.

2.0K

Yanda Chen@yanda_chen_ · Feb 13

Results of our jailbreaking challenge: After 5 days, >300,000 messages, and est. 3,700 collective hours our system got broken. In the end 4 users passed all levels, 1 found a universal jailbreak. We’re paying $55k in total to the winners. Thanks to everyone who participated!

AAnthropic@AnthropicAI · Feb 3

New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.

105

126

2.0K

379

554.0K

Yanda Chen@yanda_chen_ · Feb 3

After thousands of hours of red teaming, we think our new system achieves an unprecedented level of adversarial robustness to universal jailbreaks, a key threat for misusing LLMs. Try jailbreaking the model yourself, using our demo here: claude.ai/constitutional…

AAnthropic@AnthropicAI · Feb 3

New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.

138

18.0K

Yanda Chen Retweeted

Jan Leike@janleike · Feb 3

We challenge you to break our new jailbreaking defense! There are 8 levels. Can you find a single jailbreak to beat them all? claude.ai/constitutional…

382

265

4.0K

2.0K

1.3M

Yanda Chen Retweeted

Anthropic@AnthropicAI · Feb 3

New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.

333

301

2.0K

856

1.3M

Yanda Chen@yanda_chen_ · Dec 26

DeepSeek (Chinese AI co) making it look easy today with an open weights release of a frontier-grade LLM trained on a joke of a budget (2048 GPUs for 2 months, $6M). For reference, this level of capability is supposed to require clusters of closer to 16K GPUs, the ones being…

DDeepSeek@deepseek_ai · Dec 26

🚀 Introducing DeepSeek-V3! Biggest leap forward yet: ⚡ 60 tokens/second (3x faster than V2!) 💪 Enhanced capabilities 🛠 API compatibility intact 🌍 Fully open-source models & papers 🐋 1/n

409

2.0K

19.0K

8.0K

6.5M

Yanda Chen@yanda_chen_ · Dec 18

Very important alignment research result: A demonstration of strategic deception arising naturally in LLM training

AAnthropic@AnthropicAI · Dec 18

New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.

877

271

93.0K

Yanda Chen@yanda_chen_ · Dec 13

I’m really excited that this is out! Best-of-N jailbreaking shows you don’t need advanced techniques to get harmful outputs from current LLMs. you can instead use augmentations and scale up compute Clearly more work needed on adversarial robustness!

AAnthropic@AnthropicAI · Dec 13

New research collaboration: “Best-of-N Jailbreaking”. We found a simple, general-purpose method that jailbreaks (bypasses the safety features of) frontier AI models, and that works across text, vision, and audio.

1.0K

Yanda Chen Retweeted

Ruiqi Zhong@ZhongRuiqi · Dec 10

I'm on the academic job market! I build AI systems that assist humans in complicated tasks (e.g. pattern discovery/automate software development), and focus on cases when their outputs are hard-to-explain or evaluate. I'll be at NeurIPS'24 from 12/10-12/15. Happy to catch up!

206

24.0K

Yanda Chen@yanda_chen_ · Dec 2

i'll be supervising some fellows on this program on projects related to adversarial robustness! apply below, and feel free to dm me :-)

AAnthropic@AnthropicAI · Dec 2

We’re starting a Fellows program to help engineers and researchers transition into doing frontier AI safety research full-time. Beginning in March 2025, we'll provide funding, compute, and research mentorship to 10–15 Fellows with strong coding and technical backgrounds.

5.0K

Yanda Chen@yanda_chen_ · Dec 3

Many of our best papers have come through collaborations with academics and people transitioning into AI safety researchers from outside Anthropic. Very excited that we are expanding our collaborations here - come apply to work with us!

AAnthropic@AnthropicAI · Dec 2

3.0K