Samuel Marks

@saprmarks

AI safety research @AnthropicAI. Prev postdoc in LLM interpretability with @davidbau, math PhD at @Harvard, director of technical programs at http://haist.ai

Boston

Joined October 2023

123Following

4KFollowers

Pinned

Samuel Marks@saprmarks · Mar 13

New paper with @j_treutlein , @EvanHub , and many other coauthors! We train a model with a hidden misaligned objective and use it to run an auditing game: Can other teams of researchers uncover the model’s objective? x.com/AnthropicAI/st…

AAnthropic@AnthropicAI · Mar 13

New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?

135

34.0K

Samuel Marks Retweeted

Fabien Roger@FabienDRoger · 18 h

Some implications: * emergent misalignment experiments should not use the same model for data generation and fine-tuning * filtering + re-distillation is probably worse than you might think * data poisoning from insiders is maybe hard to catch (unclear how viable this is)

1.0K

Samuel Marks@saprmarks · Jul 15

I'm excited to discuss downstream applications of interpretability at @ActInterp! For a preview of my thoughts on the topic, see my blog post on how I think about picking applications to target x.com/saprmarks/stat…

TTal Haklay ✈️ACL@tal_haklay · Jul 10

🚨Meet our panelists at the Actionable Interpretability Workshop @ActInterp at @icmlconf! Join us July 19 at 4pm for a panel on making interpretability research actionable, its challenges, and how the community can drive greater impact. @nsaphra @saprmarks @kylelostat @FazlBarez

4.0K

Samuel Marks Retweeted

Bowen Baker@bobabowen · Jul 15

Modern reasoning models think in plain English. Monitoring their thoughts could be a powerful, yet fragile, tool for overseeing future AI systems. I and researchers across many organizations think we should work to evaluate, preserve, and even improve CoT monitorability.

147

788

505

680.0K

Samuel Marks Retweeted

Boaz Barak@boazbaraktcs · Jul 15

I didn't want to post on Grok safety since I work at a competitor, but it's not about competition. I appreciate the scientists and engineers at @xai but the way safety was handled is completely irresponsible. Thread below.

326

337

5.0K

2.0K

1.1M