Samuel Marks
@saprmarks
AI safety research @AnthropicAI. Prev postdoc in LLM interpretability with @davidbau, math PhD at @Harvard, director of technical programs at http://haist.ai
New paper with @j_treutlein , @EvanHub , and many other coauthors! We train a model with a hidden misaligned objective and use it to run an auditing game: Can other teams of researchers uncover the model’s objective? x.com/AnthropicAI/st…
New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
Some implications: * emergent misalignment experiments should not use the same model for data generation and fine-tuning * filtering + re-distillation is probably worse than you might think * data poisoning from insiders is maybe hard to catch (unclear how viable this is)
I'm excited to discuss downstream applications of interpretability at @ActInterp! For a preview of my thoughts on the topic, see my blog post on how I think about picking applications to target x.com/saprmarks/stat…
🚨Meet our panelists at the Actionable Interpretability Workshop @ActInterp at @icmlconf! Join us July 19 at 4pm for a panel on making interpretability research actionable, its challenges, and how the community can drive greater impact. @nsaphra @saprmarks @kylelostat @FazlBarez
Modern reasoning models think in plain English. Monitoring their thoughts could be a powerful, yet fragile, tool for overseeing future AI systems. I and researchers across many organizations think we should work to evaluate, preserve, and even improve CoT monitorability.
I didn't want to post on Grok safety since I work at a competitor, but it's not about competition. I appreciate the scientists and engineers at @xai but the way safety was handled is completely irresponsible. Thread below.