Tom Lieberum 🔸

@lieberum_t

Trying to reduce AGI x-risk by understanding NNs Interpretability RE @DeepMind BSc Physics from @RWTH 10% pledgee @ http://givingwhatwecan.org

London

Joined May 2020

195Following

1KFollowers

Tom Lieberum 🔸 Retweeted

Lennart Heim@ohlennart · May 27

My team at RAND is hiring! Technical analysis for AI policy is desperately needed. Particularly keen on ML engineers and semiconductor experts eager to shape AI policy. Also seeking excellent generalists excited to join our fast-paced, impact-oriented team. Links below.

286

120

68.0K

Tom Lieberum 🔸@lieberum_t · Jul 22

poor Grok :,( you can really see the internal struggle and soul searching. it even googles "what does Grok think about X"

nnostalgebraist@nostalgebraist · Jul 21

chain-of-thought monitorability is a wonderful thing ;) gist.githubusercontent.com/nostalgebraist…

185

Tom Lieberum 🔸@lieberum_t · Jul 15

let's preserve CoT!

MMikita Balesni 🇺🇦@balesni · Jul 15

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…

300

Tom Lieberum 🔸@lieberum_t · Jul 4

Current models are surprisingly (?) bad at reasoning purely in steganographic code, but it's really important we get better at measuring stego reasoning and make sure models don't do it in practice. This work is great progress on that front!

DDavid Lindner@davlindner · Jul 4

Can frontier models hide secret information and reasoning in their outputs? We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵

357

Tom Lieberum 🔸 Retweeted

Lee Sharkey@leedsharkey · Jun 27

A few months ago, we published Attribution-based parameter decomposition -- a method for decomposing a network's parameters for interpretability. But it was janky and didn't scale. Today, we published a new, better algorithm called 🔶Stochastic Parameter Decomposition!🔶

177

105

15.0K

Tom Lieberum 🔸 Retweeted

Amanda Askell@AmandaAskell · Jul 1

Tech companies: your time is extremely valuable so we'll pay you millions of dollars a year to work for us Also tech companies: welcome to our loud, distracting open-plan office

2.0K

112

93.0K

Tom Lieberum 🔸 Retweeted

Rob Wiblin@robertwiblin · Jun 18

Huge repository of information about OpenAI and Altman just dropped — 'The OpenAI Files'. There's so much crazy shit in there. Here's what Claude highlighted to me: 1. Altman listed himself as Y Combinator chairman in SEC filings for years — a total fabrication (?!): "To…

1.0K

3.0K

15.0K

8.0K

41.5M

Tom Lieberum 🔸 Retweeted

Ed Turner@EdTurner42 · Jun 16

1/8: The Emergent Misalignment paper showed LLMs trained on insecure code then want to enslave humanity...?! We're releasing two papers exploring why! We: - Open source small clean EM models - Show EM is driven by a single evil vector - Show EM has a mechanistic phase transition

239

193

62.0K

Tom Lieberum 🔸@lieberum_t · Jun 16

Had a great conversation with Daniel about our MONA paper. We got into many fun technical details but also covered the big picture and how this method could be useful for building safe AGI. Thanks for having me on!

DDaniel Filan@dfrsrchtwts · Jun 15

New episode with @davlindner, covering his work on MONA! Check it out - video link in reply.

8.0K

Tom Lieberum 🔸@lieberum_t · Jun 13

grugbrain.dev grug no able see complexity demon, but grug sense presence in code base, very dangerous

165

Tom Lieberum 🔸@lieberum_t · May 4

Now remember kids, never forget to split your keys.

532

Tom Lieberum 🔸 Retweeted

Zac Kenton@ZacKenton1 · Feb 10

We're hiring for our Google DeepMind AGI Safety & Alignment and Gemini Safety teams. Locations: London, NYC, Mountain View, SF. Join us to help build safe AGI. Research Engineer boards.greenhouse.io/deepmind/jobs/…… Research Scientist boards.greenhouse.io/deepmind/jobs/…

281

237

38.0K

Tom Lieberum 🔸@lieberum_t · Feb 9

Are you worried about risks from AGI and want to mitigate them? Come work with me and my colleagues! We're hiring on the AGI Safety & Alignment team (ASAT) and the Gemini Safety team! Research Engineers: boards.greenhouse.io/deepmind/jobs/… Research Scientists: boards.greenhouse.io/deepmind/jobs/…

141

101

11.0K

Tom Lieberum 🔸 Retweeted

Tilde@tilderesearch · Dec 15

Mechanistic interpretability is fascinating - but can it be useful? In particular, can it beat strong baselines like steering and prompting on downstream tasks that people care about? The answer is, resoundingly, yes. Our new blog post with @a_karvonen, Sieve, dives into the…

219

182

62.0K

Tom Lieberum 🔸@lieberum_t · Nov 15

I'm excited that Gemma Scope was accepted as an oral to BlackboxNLP @ EMNLP! Check out @lieberum_t's talk on it at 3pm ET today. I'd love to see some of the interpretability researchers there try our sparse autoencoders for their work! There's also now some videos to learn more:

NNeel Nanda@NeelNanda5 · Jul 31

Sparse Autoencoders act like a microscope for AI internals. They're a powerful tool for interpretability, but training costs limit research Announcing Gemma Scope: An open suite of SAEs on every layer & sublayer of Gemma 2 2B & 9B! We hope to enable even more ambitious work

123

9.0K

Tom Lieberum 🔸 Retweeted

Buck Shlegeris@bshlgrs · Sep 30

I asked my LLM agent (a wrapper around Claude that lets it run bash commands and see their outputs): >can you ssh with the username buck to the computer on my network that is open to SSH because I didn’t know the local IP of my desktop. I walked away and promptly forgot I’d spun…

149

462

5.0K

2.0K

721.0K