Tom Lieberum 🔸
@lieberum_t
Trying to reduce AGI x-risk by understanding NNs Interpretability RE @DeepMind BSc Physics from @RWTH 10% pledgee @ http://givingwhatwecan.org
My team at RAND is hiring! Technical analysis for AI policy is desperately needed. Particularly keen on ML engineers and semiconductor experts eager to shape AI policy. Also seeking excellent generalists excited to join our fast-paced, impact-oriented team. Links below.
poor Grok :,( you can really see the internal struggle and soul searching. it even googles "what does Grok think about X"
chain-of-thought monitorability is a wonderful thing ;) gist.githubusercontent.com/nostalgebraist…
let's preserve CoT!
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…
Current models are surprisingly (?) bad at reasoning purely in steganographic code, but it's really important we get better at measuring stego reasoning and make sure models don't do it in practice. This work is great progress on that front!
Can frontier models hide secret information and reasoning in their outputs? We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵
A few months ago, we published Attribution-based parameter decomposition -- a method for decomposing a network's parameters for interpretability. But it was janky and didn't scale. Today, we published a new, better algorithm called 🔶Stochastic Parameter Decomposition!🔶
Tech companies: your time is extremely valuable so we'll pay you millions of dollars a year to work for us Also tech companies: welcome to our loud, distracting open-plan office
Huge repository of information about OpenAI and Altman just dropped — 'The OpenAI Files'. There's so much crazy shit in there. Here's what Claude highlighted to me: 1. Altman listed himself as Y Combinator chairman in SEC filings for years — a total fabrication (?!): "To…
1/8: The Emergent Misalignment paper showed LLMs trained on insecure code then want to enslave humanity...?! We're releasing two papers exploring why! We: - Open source small clean EM models - Show EM is driven by a single evil vector - Show EM has a mechanistic phase transition
Had a great conversation with Daniel about our MONA paper. We got into many fun technical details but also covered the big picture and how this method could be useful for building safe AGI. Thanks for having me on!
New episode with @davlindner, covering his work on MONA! Check it out - video link in reply.
grugbrain.dev grug no able see complexity demon, but grug sense presence in code base, very dangerous
We're hiring for our Google DeepMind AGI Safety & Alignment and Gemini Safety teams. Locations: London, NYC, Mountain View, SF. Join us to help build safe AGI. Research Engineer boards.greenhouse.io/deepmind/jobs/…… Research Scientist boards.greenhouse.io/deepmind/jobs/…
Are you worried about risks from AGI and want to mitigate them? Come work with me and my colleagues! We're hiring on the AGI Safety & Alignment team (ASAT) and the Gemini Safety team! Research Engineers: boards.greenhouse.io/deepmind/jobs/… Research Scientists: boards.greenhouse.io/deepmind/jobs/…
Mechanistic interpretability is fascinating - but can it be useful? In particular, can it beat strong baselines like steering and prompting on downstream tasks that people care about? The answer is, resoundingly, yes. Our new blog post with @a_karvonen, Sieve, dives into the…
I'm excited that Gemma Scope was accepted as an oral to BlackboxNLP @ EMNLP! Check out @lieberum_t's talk on it at 3pm ET today. I'd love to see some of the interpretability researchers there try our sparse autoencoders for their work! There's also now some videos to learn more:
Sparse Autoencoders act like a microscope for AI internals. They're a powerful tool for interpretability, but training costs limit research Announcing Gemma Scope: An open suite of SAEs on every layer & sublayer of Gemma 2 2B & 9B! We hope to enable even more ambitious work
I asked my LLM agent (a wrapper around Claude that lets it run bash commands and see their outputs): >can you ssh with the username buck to the computer on my network that is open to SSH because I didn’t know the local IP of my desktop. I walked away and promptly forgot I’d spun…