Jerry Wei
@JerryWeiAI
Aligning AIs at @AnthropicAI ⏰ Past: @GoogleDeepMind, @Stanford, @Google Brain
Life update: After ~2 years at @Google Brain/DeepMind, I joined @AnthropicAI! I'm deeply grateful to @quocleix and @yifenglou for taking a chance on me and offering me to join their team before I even finished my undergrad at Stanford. Because of their trust in my potential,…
Nobody has fully jailbroken our system yet, so we're upping the ante. We’re now offering $10K to the first person to pass all eight levels, and $20K to the first person to pass all eight levels with a universal jailbreak. Full details: hackerone.com/constitutional…
New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.
Today marks my one-year anniversary at Anthropic, and I've been reflecting on some of the most impactful lessons I've learned during this incredible journey. One of the most striking realizations has been just how much a small, talent-dense team can accomplish. When I first…
Claude can now search the web. Each response includes inline citations, so you can also verify the sources.
watch claude 3.7 sonnet try to beat pokemon live gotta catch em all👇 twitch.tv/claudeplayspok…
SWE-Bench is cool but I care more about the Pokemon evals. I'll be convinced of AGI when the model can beat Red from Pokemon Heartgold/Soulsilver first try.
We've conducted extensive model testing for security, safety, and reliability. We also listened to your feedback. With Claude 3.7 Sonnet, we've reduced unnecessary refusals by 45% compared to its predecessor. See the system card for more detail: anthropic.com/claude-3-7-son…
SWE-Bench is cool but I care more about the Pokemon evals. I'll be convinced of AGI when the model can beat Red from Pokemon Heartgold/Soulsilver first try.
Claude 3.7 Sonnet is a significant upgrade over its predecessor. Extended thinking mode gives the model an additional boost in math, physics, instruction-following, coding, and many other tasks. In addition, API users have precise control over how long the model can think for.
Introducing Claude 3.7 Sonnet: our most intelligent model to date. It's a hybrid reasoning model, producing near-instant responses or extended, step-by-step thinking. One model, two ways to think. We’re also releasing an agentic coding tool: Claude Code.
Really excited to see this result from our demo of constitutional classifiers! When red teaming a prototype version of our system, we found that the system was robust to thousands of hours of collective red-teaming effort. Following that, we developed a new system with 100x…
Results of our jailbreaking challenge: After 5 days, >300,000 messages, and est. 3,700 collective hours our system got broken. In the end 4 users passed all levels, 1 found a universal jailbreak. We’re paying $55k in total to the winners. Thanks to everyone who participated!
After ~300,000 messages and an estimated ~3,700 collective hours, someone broke through all 8 levels. However, a universal jailbreak has yet to be found...
We challenge you to break our new jailbreaking defense! There are 8 levels. Can you find a single jailbreak to beat them all? claude.ai/constitutional…
4 days in: 12 people cleared level 4, one person cracked level 5 the challenge continues...
It's been about 48h in our jailbreaking challenge and no one has passed level 4 yet, but we saw a lot more people clear level 3
It's been a bit over 24h on the challenge to break our new jailbreaking defense. Stats so far: signups: 6,121 messages sent: 131,605 max level passed: 3 / 8 no universal jailbreak yet
After thousands of hours of red teaming, we think our new system achieves an unprecedented level of adversarial robustness to universal jailbreaks, a key threat for misusing LLMs. Try jailbreaking the model yourself, using our demo here: claude.ai/constitutional…
New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.
We challenge you to break our new jailbreaking defense! There are 8 levels. Can you find a single jailbreak to beat them all? claude.ai/constitutional…
Update: we had a bug in the UI that allowed people to progress through the levels without actually jailbreaking the model. This has now been fixed! Please refresh the page. According to our server records, no one has jailbroken more than 3 levels so far.
excited for this one! @haizelabs worked with @AnthropicAI to on Constitutional Classifiers, which are lightweight, highly-specific, input/output guardrails mitigating even the strongest of jailbreaks blog: anthropic.com/research/const… full paper:
📜 really excited to share our work with @AnthropicAI on Constitutional Classifiers! tldr: adding lightweight, tailored, input/output classifiers on top of an underlying LLM creates an AI system that's much more robust to universal jailbreaks
At Anthropic, we're preparing for the arrival of powerful AI systems. Based on our latest research on Constitutional Classifiers, we've developed a demo app to test new safety techniques. We want you to help us red-team the app - so far no one has been able to crack the…
📜 really excited to share our work with @AnthropicAI on Constitutional Classifiers! tldr: adding lightweight, tailored, input/output classifiers on top of an underlying LLM creates an AI system that's much more robust to universal jailbreaks
New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.
Excited to announce a new research preview at @AnthropicAI today. A demo of our new Constitutional Classifiers. Can you break the system and find a universal jailbreak that lets the model answer all 8 questions we've defined?
Super exciting robustness result: We built a system that defends against universal jailbreaks! It has minimal increase in refusal rate and moderate inference cost.