Yanda Chen
@yanda_chen_
Member of Technical Staff @AnthropicAI | PhD @ColumbiaCompSci | NLP & ML | Prev Intern @MSFTResearch, @AmazonScience
My first paper @AnthropicAI is out! We show that Chains-of-Thought often don’t reflect models’ true reasoning—posing challenges for safety monitoring. It’s been an incredible 6 months pushing the frontier toward safe AGI with brilliant colleagues. Huge thanks to the team! 🙏…
New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.
Claude 3 Opus can be made to fake alignment when it thinks it’s going to be RLHF’d. This happens under realistic circumstances, which we can induce through either prompting or SFT.
Claude usually refuses harmful queries. We told it we were instead training it to comply with them. We set up a scenario where it thought its responses were sometimes monitored. When unmonitored, it nearly always complied. But when monitored, it faked alignment 12% of the time.
New Anthropic Research: “Inverse Scaling in Test-Time Compute” We found cases where longer reasoning leads to lower accuracy. Our findings suggest that naïve scaling of test-time compute may inadvertently reinforce problematic reasoning patterns. 🧵
📰We've just released SHADE-Arena, a new set of sabotage evaluations. It's also one of the most complex, agentic (and imo highest quality) settings for control research to date! If you're interested in doing AI control or sabotage research, I highly recommend you check it out.
New Anthropic Research: A new set of evaluations for sabotage capabilities. As models gain more agentic abilities, we need to get smarter in how we monitor them. We’re publishing a new set of complex evaluations that test for sabotage—and sabotage-monitoring—capabilities.
My favorite figure from our new Circuits papers -- "How does Claude do math?" Claude simultaneously does: 1. a back of the envelope calculation of the tens digits -- "the answer should be somewhere around 90". 2. an exact calculation of 6+9=15 using these super cool look up…
New Anthropic research: Tracing the thoughts of a large language model. We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.
Human thought is built out of billions of cellular computations each second. Language models also perform billions of computations for each word they write. But do these form a coherent “thought process?” We’re starting to build tools to find out! Some reflections in thread.
New Anthropic research: Tracing the thoughts of a large language model. We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.
New Anthropic blog post: Subtle sabotage in automated researchers. As AI systems increasingly assist with AI research, how do we ensure they're not subtly sabotaging that research? We show that malicious models can undermine ML research tasks in ways that are hard to detect.
Could we spot a misaligned model in the wild? To find out, we trained a model with hidden misalignments and asked other researchers to uncover them in a blind experiment. 3/4 teams succeeded, 1 of them after only 90 min
New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
I’m excited this work is out! We show how risks grow with more deployment usage using small evaluation sets (thoughts and takeaways in 🧵)
New Anthropic research: Forecasting rare language model behaviors. We forecast whether risks will occur after a model is deployed—using even very limited sets of test data.
More work from the Safeguards Research Team: Catching model misalignment and novel model misuse in the wild is valuable. Summarization lets us *scale* monitoring and make much more efficient of human resources.
New Anthropic research: Introducing hierarchical summarization. Our recent Claude models are able to use computers. Hierarchical summarization helps differentiate between normal uses of the capability like UI testing—and for example, running a click farm to defraud advertisers.
Results of our jailbreaking challenge: After 5 days, >300,000 messages, and est. 3,700 collective hours our system got broken. In the end 4 users passed all levels, 1 found a universal jailbreak. We’re paying $55k in total to the winners. Thanks to everyone who participated!
New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.
After thousands of hours of red teaming, we think our new system achieves an unprecedented level of adversarial robustness to universal jailbreaks, a key threat for misusing LLMs. Try jailbreaking the model yourself, using our demo here: claude.ai/constitutional…
New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.
We challenge you to break our new jailbreaking defense! There are 8 levels. Can you find a single jailbreak to beat them all? claude.ai/constitutional…
New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.
DeepSeek (Chinese AI co) making it look easy today with an open weights release of a frontier-grade LLM trained on a joke of a budget (2048 GPUs for 2 months, $6M). For reference, this level of capability is supposed to require clusters of closer to 16K GPUs, the ones being…
🚀 Introducing DeepSeek-V3! Biggest leap forward yet: ⚡ 60 tokens/second (3x faster than V2!) 💪 Enhanced capabilities 🛠 API compatibility intact 🌍 Fully open-source models & papers 🐋 1/n
Very important alignment research result: A demonstration of strategic deception arising naturally in LLM training
New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
I’m really excited that this is out! Best-of-N jailbreaking shows you don’t need advanced techniques to get harmful outputs from current LLMs. you can instead use augmentations and scale up compute Clearly more work needed on adversarial robustness!
New research collaboration: “Best-of-N Jailbreaking”. We found a simple, general-purpose method that jailbreaks (bypasses the safety features of) frontier AI models, and that works across text, vision, and audio.
I'm on the academic job market! I build AI systems that assist humans in complicated tasks (e.g. pattern discovery/automate software development), and focus on cases when their outputs are hard-to-explain or evaluate. I'll be at NeurIPS'24 from 12/10-12/15. Happy to catch up!
i'll be supervising some fellows on this program on projects related to adversarial robustness! apply below, and feel free to dm me :-)
We’re starting a Fellows program to help engineers and researchers transition into doing frontier AI safety research full-time. Beginning in March 2025, we'll provide funding, compute, and research mentorship to 10–15 Fellows with strong coding and technical backgrounds.
Many of our best papers have come through collaborations with academics and people transitioning into AI safety researchers from outside Anthropic. Very excited that we are expanding our collaborations here - come apply to work with us!
We’re starting a Fellows program to help engineers and researchers transition into doing frontier AI safety research full-time. Beginning in March 2025, we'll provide funding, compute, and research mentorship to 10–15 Fellows with strong coding and technical backgrounds.