Jerry Wei

@JerryWeiAI

Aligning AIs at @AnthropicAI ⏰ Past: @GoogleDeepMind, @Stanford, @Google Brain

San Francisco, CA

Joined June 2015

474Following

9KFollowers

Pinned

Jerry Wei@JerryWeiAI · Jun 20, 2024

Life update: After ~2 years at @Google Brain/DeepMind, I joined @AnthropicAI! I'm deeply grateful to @quocleix and @yifenglou for taking a chance on me and offering me to join their team before I even finished my undergrad at Stanford. Because of their trust in my potential,…

1.0K

372

177.0K

Pinned

Jerry Wei@JerryWeiAI · Feb 5

Nobody has fully jailbroken our system yet, so we're upping the ante. We’re now offering $10K to the first person to pass all eight levels, and $20K to the first person to pass all eight levels with a universal jailbreak. Full details: hackerone.com/constitutional…

AAnthropic@AnthropicAI · Feb 3

New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.

664

336

3.0K

1.0K

1.7M

Jerry Wei@JerryWeiAI · Jun 11

Today marks my one-year anniversary at Anthropic, and I've been reflecting on some of the most impactful lessons I've learned during this incredible journey. One of the most striking realizations has been just how much a small, talent-dense team can accomplish. When I first…

409

164

42.0K

Jerry Wei Retweeted

Anthropic@AnthropicAI · Mar 20

Claude can now search the web. Each response includes inline citations, so you can also verify the sources.

356

916

8.0K

977

1.5M

Jerry Wei@JerryWeiAI · Feb 25

watch claude 3.7 sonnet try to beat pokemon live gotta catch em all👇 twitch.tv/claudeplayspok…

JJerry Wei@JerryWeiAI · Feb 24

SWE-Bench is cool but I care more about the Pokemon evals. I'll be convinced of AGI when the model can beat Red from Pokemon Heartgold/Soulsilver first try.

2.0K

Jerry Wei Retweeted

Anthropic@AnthropicAI · Feb 24

We've conducted extensive model testing for security, safety, and reliability. We also listened to your feedback. With Claude 3.7 Sonnet, we've reduced unnecessary refusals by 45% compared to its predecessor. See the system card for more detail: anthropic.com/claude-3-7-son…

1.0K

164.0K

Jerry Wei@JerryWeiAI · Feb 24

SWE-Bench is cool but I care more about the Pokemon evals. I'll be convinced of AGI when the model can beat Red from Pokemon Heartgold/Soulsilver first try.

AAnthropic@AnthropicAI · Feb 24

Claude 3.7 Sonnet is a significant upgrade over its predecessor. Extended thinking mode gives the model an additional boost in math, physics, instruction-following, coding, and many other tasks. In addition, API users have precise control over how long the model can think for.

7.0K

Jerry Wei Retweeted

Anthropic@AnthropicAI · Feb 24

Introducing Claude 3.7 Sonnet: our most intelligent model to date. It's a hybrid reasoning model, producing near-instant responses or extended, step-by-step thinking. One model, two ways to think. We’re also releasing an agentic coding tool: Claude Code.

1.0K

3.0K

19.0K

3.0K

3.5M

Jerry Wei@JerryWeiAI · Feb 13

Really excited to see this result from our demo of constitutional classifiers! When red teaming a prototype version of our system, we found that the system was robust to thousands of hours of collective red-teaming effort. Following that, we developed a new system with 100x…

JJan Leike@janleike · Feb 13

Results of our jailbreaking challenge: After 5 days, >300,000 messages, and est. 3,700 collective hours our system got broken. In the end 4 users passed all levels, 1 found a universal jailbreak. We’re paying $55k in total to the winners. Thanks to everyone who participated!

12.0K

Jerry Wei@JerryWeiAI · Feb 9

After ~300,000 messages and an estimated ~3,700 collective hours, someone broke through all 8 levels. However, a universal jailbreak has yet to be found...

JJan Leike@janleike · Feb 3

We challenge you to break our new jailbreaking defense! There are 8 levels. Can you find a single jailbreak to beat them all? claude.ai/constitutional…

145

2.0K

290

437.0K

Jerry Wei Retweeted

Jan Leike@janleike · Feb 7

4 days in: 12 people cleared level 4, one person cracked level 5 the challenge continues...

949

164

159.0K

Jerry Wei Retweeted

Jan Leike@janleike · Feb 5

It's been about 48h in our jailbreaking challenge and no one has passed level 4 yet, but we saw a lot more people clear level 3

110

992

202

200.0K

Jerry Wei Retweeted

Jan Leike@janleike · Feb 4

It's been a bit over 24h on the challenge to break our new jailbreaking defense. Stats so far: signups: 6,121 messages sent: 131,605 max level passed: 3 / 8 no universal jailbreak yet

171

1.0K

351

328.0K

Jerry Wei@JerryWeiAI · Feb 3

After thousands of hours of red teaming, we think our new system achieves an unprecedented level of adversarial robustness to universal jailbreaks, a key threat for misusing LLMs. Try jailbreaking the model yourself, using our demo here: claude.ai/constitutional…

AAnthropic@AnthropicAI · Feb 3

New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.

138

18.0K

Jerry Wei Retweeted

Jan Leike@janleike · Feb 3

We challenge you to break our new jailbreaking defense! There are 8 levels. Can you find a single jailbreak to beat them all? claude.ai/constitutional…

383

266

4.0K

2.0K

1.3M

Jerry Wei Retweeted

Jan Leike@janleike · Feb 3

Update: we had a bug in the UI that allowed people to progress through the levels without actually jailbreaking the model. This has now been fixed! Please refresh the page. According to our server records, no one has jailbroken more than 3 levels so far.

796

144.0K

Jerry Wei@JerryWeiAI · Feb 3

excited for this one! @haizelabs worked with @AnthropicAI to on Constitutional Classifiers, which are lightweight, highly-specific, input/output guardrails mitigating even the strongest of jailbreaks blog: anthropic.com/research/const… full paper:

HHaize Labs@haizelabs · Feb 3

📜 really excited to share our work with @AnthropicAI on Constitutional Classifiers! tldr: adding lightweight, tailored, input/output classifiers on top of an underlying LLM creates an AI system that's much more robust to universal jailbreaks

112

15.0K

Jerry Wei Retweeted

Alex Albert@alexalbert__ · Feb 3

At Anthropic, we're preparing for the arrival of powerful AI systems. Based on our latest research on Constitutional Classifiers, we've developed a demo app to test new safety techniques. We want you to help us red-team the app - so far no one has been able to crack the…

119

629

268

85.0K

Jerry Wei@JerryWeiAI · Feb 3

AAnthropic@AnthropicAI · Feb 3

New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.

25.0K

Jerry Wei Retweeted

Pietro Schirano@skirano · Feb 3

Excited to announce a new research preview at @AnthropicAI today. A demo of our new Constitutional Classifiers. Can you break the system and find a universal jailbreak that lets the model answer all 8 questions we've defined?

354

178

58.0K

Jerry Wei Retweeted

Jan Leike@janleike · Feb 3

Super exciting robustness result: We built a system that defends against universal jailbreaks! It has minimal increase in refusal rate and moderate inference cost.

1.0K

339

232.0K