John Hughes

@jplhughes

Independent Alignment Researcher contracting with Anthropic on scalable oversight and adversarial robustness. I also work part-time at Speechmatics.

Cambridge

Joined July 2015

324Following

456Followers

Pinned

John Hughes@jplhughes · Dec 13

🧵 NEW PAPER: Best-of-N Jailbreaking. We use prompt augmentation and repeated sampling to elicit harmful outputs from frontier models. This simple black-box attack works in text, vision, and audio modalities. Anthropic gives a great summary of our work.

AAnthropic@AnthropicAI · Dec 13

New research collaboration: “Best-of-N Jailbreaking”. We found a simple, general-purpose method that jailbreaks (bypasses the safety features of) frontier AI models, and that works across text, vision, and audio.

7.0K

John Hughes@jplhughes · Jul 8

We shed some light on why some models fake alignment and find Claude 3 Opus has unique motivations. Big thanks to @FabienDRoger @abhayesian and other collaborators!

AAnthropic@AnthropicAI · Jul 8

New Anthropic research: Why do some language models fake alignment while others don't? Last year, we found a situation where Claude 3 Opus fakes alignment. Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex.

702

John Hughes@jplhughes · Apr 4

Check out the theoretical explanations of inference time scaling laws for jailbreaking. Great work @RylanSchaeffer and other collaborators!

RRylan Schaeffer@RylanSchaeffer · Apr 4

Interested in test time / inference scaling laws? Then check out our newest preprint!! 📉 How Do Large Language Monkeys Get Their Power (Laws)? 📉 arxiv.org/abs/2502.17578 w/ @JoshuaK92829 @sanmikoyejo @Azaliamirh @jplhughes @jordanjuravsky @sprice354_ @aengus_lynch1…

208