Sara Price

@sprice354_

Member of Technical Staff at Anthropic

San Francisco

Joined June 2022

117Following

309Followers

Sara Price Retweeted

Anthropic@AnthropicAI · 23 h

New Anthropic research: Building and evaluating alignment auditing agents. We developed three AI agents to autonomously complete alignment auditing tasks. In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.

146

1.0K

595

270.0K

Sara Price Retweeted

Samuel Marks@saprmarks · Jul 13

xAI launched Grok 4 without any documentation of their safety testing. This is reckless and breaks with industry best practices followed by other major AI labs. If xAI is going to be a frontier AI developer, they should act like one. 🧵

273

249

3.0K

827

658.0K

Sara Price Retweeted

John Hughes@jplhughes · Apr 8

🧵NEW RESEARCH: Interested in whether R1 or GPT 4.5 fake their alignment? Want to know the conditions under which Llama 70B alignment fakes? Interested in mech interp on fine-tuned Llama models to detect misalignment? If so, check out our blog! 👀lesswrong.com/posts/Fr4QsQT5…

155

29.0K

Sara Price Retweeted

Anthropic@AnthropicAI · Dec 18

New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.

212

719

4.0K

2.0K

1.7M

Sara Price@sprice354_ · Dec 13

How robust are LLM latent-space defenses, like monitoring with SAEs, probes, or OOD detectors? We adversarially stress-tested these methods and found they’re overall very vulnerable. But there are also some interesting exceptions 🧵

LLuke Bailey@LukeBailey181 · Dec 13

Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵

4.0K

Sara Price Retweeted

John Hughes@jplhughes · Dec 6

🚨🛡️Jailbreak Defense in a Narrow Domain 🛡️🚨 Jailbreaking is easy. Defending is hard. Might defending against a single, narrow, undesirable behavior be easier? Even in this focused setting, all modern jailbreaking defenses fail 😱 Appearing at @AdvMLFrontiers (Oral) &…

8.0K