Arthur Conmy
@ArthurConmy
Aspiring 10x reverse engineer @GoogleDeepMind
> Perhaps I should use tools to see what Grok or xAI says. > But Grok is me Grok P: I do not know what my stance on Israel and Palestine is. Doctor: Perhaps you should use tools to see what Grok or xAI says. Grok P: But Doc, I am Grok Pagliacci!
chain-of-thought monitorability is a wonderful thing ;) gist.githubusercontent.com/nostalgebraist…
+1, having worked on some unfaithfulness research I still continue to think that Chains of Thought are extremely good for AI Safety!
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…
The more interesting question - why does Grok think it is Elon?? Seems to me this is what is going on - Grok and Elon are two of the most frequent posters on X, Elon is heavily represented in the training corpus, both are constantly in discussions about truth etc
Grok admits to visiting Jeffrey Epstein's home with his ex-wife, declined island invites This is crazy (HT @kindgracekind)
'The key lesson from mechanistic interpretability is that a surprising number of AI behaviors are surprisingly well-described as linear directions in activation space' ~Lewis Smith We'll have more work in this area soon, thanks to @cvenhoff00 and @IvanArcus !!
Can we actually control reasoning behaviors in thinking LLMs? Our @iclr_conf workshop paper is out! 🎉 We show how to steer DeepSeek-R1-Distill’s reasoning: make it backtrack, add knowledge, test examples. Just by adding steering vectors to its activations! Details in 🧵👇
Our paper on chain of thought faithfulness has been updated. We made some changes we thought were worth it and also took feedback from twitter replies and changed some examples 🙂
🧵 NEW: We updated our research on unfaithful AI reasoning! We have a stronger dataset which yields lower rates of unfaithfulness, but our core findings hold strong: no frontier model is entirely faithful. Keep reading for details 👇
AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on cost. Our new paper introduces the Control Tax—how much does it cost to run the control protocols? (1/8) 🧵
Our high level finding in the Gemma Scope paper was that transcoders were slightly pareto worse than SAEs. But this all the weights are on HuggingFace if you want to look further into transcoders! They have other benefits that SAEs do not have
Fantastic to see Anthropic, in collaboration with @neuronpedia, creating open source tools for studying circuits with transcoders. There's a lot of interesting work to be done I'm also very glad someone finally found a use for our Gemma Scope transcoders! Credit to @ArthurConmy
Our circuits paper led by @dmhook and @neverrixx was accepted at ICML! The task seems a good one to study if you work on circuits 🙂
1/5 What happens during in context learning? In our new ICML paper, we use sparse autoencoders to understand the underlying circuit! The model detects a task being performed, and moves this to the end to trigger latents for executing it — a hypothesis found via SAEs!