Scott Emmons

@emmons_scott

Research Scientist @GoogleDeepMind | PhD from @berkeley_ai | views my own

San Francisco, California

Joined May 2015

32Following

501Followers

Pinned

Scott Emmons@emmons_scott · Dec 13

"Don't think about pink elephants." Humans can't seem to avoid certain thoughts. What about LLMs? Can we robustly monitor LLM activations to catch bad thoughts before they become actions? To study this, we crafted a real jailbreak causing this LLM activation scan. Details 👇

LLuke Bailey@LukeBailey181 · Dec 13

Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵

663

Scott Emmons Retweeted

Mikita Balesni 🇺🇦@balesni · Jul 15

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…

102

414

239

194.0K

Scott Emmons Retweeted

Rylan Schaeffer@RylanSchaeffer · Jul 30

When do universal image jailbreaks transfer between Vision-Language Models (VLMs)? Our goal was to find GCG-like universal image jailbreaks to transfer against black-box API-based VLMs e.g. Claude 3, GPT4-V, Gemini We thought this would be easy - but we were wrong! 1/N

41.0K

Scott Emmons Retweeted

Erik Jenner@jenner_erik · Jun 4, 2024

♟️Do chess-playing neural nets rely purely on simple heuristics? Or do they implement algorithms involving *look-ahead* in a single forward pass? We find clear evidence of 2-turn look-ahead in a chess-playing network, using techniques from mechanistic interpretability! 🧵

133

878

548

113.0K