Arthur Conmy

@ArthurConmy

Aspiring 10x reverse engineer @GoogleDeepMind

London, UK

Joined August 2021

1KFollowing

4KFollowers

Arthur Conmy@ArthurConmy · Jul 22

> Perhaps I should use tools to see what Grok or xAI says. > But Grok is me Grok P: I do not know what my stance on Israel and Palestine is. Doctor: Perhaps you should use tools to see what Grok or xAI says. Grok P: But Doc, I am Grok Pagliacci!

nnostalgebraist@nostalgebraist · Jul 21

chain-of-thought monitorability is a wonderful thing ;) gist.githubusercontent.com/nostalgebraist…

713

Arthur Conmy@ArthurConmy · Jul 15

+1, having worked on some unfaithfulness research I still continue to think that Chains of Thought are extremely good for AI Safety!

MMikita Balesni 🇺🇦@balesni · Jul 15

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…

1.0K

Arthur Conmy@ArthurConmy · Jul 7

The more interesting question - why does Grok think it is Elon?? Seems to me this is what is going on - Grok and Elon are two of the most frequent posters on X, Elon is heavily represented in the training corpus, both are constantly in discussions about truth etc

PPeter Wildeford (hiring!) 🇺🇸🚀@peterwildeford · Jul 6

Grok admits to visiting Jeffrey Epstein's home with his ex-wife, declined island invites This is crazy (HT @kindgracekind)

3.0K

Arthur Conmy@ArthurConmy · Jun 25

'The key lesson from mechanistic interpretability is that a surprising number of AI behaviors are surprisingly well-described as linear directions in activation space' ~Lewis Smith We'll have more work in this area soon, thanks to @cvenhoff00 and @IvanArcus !!

CConstantin Venhoff@cvenhoff00 · Jun 25

Can we actually control reasoning behaviors in thinking LLMs? Our @iclr_conf workshop paper is out! 🎉 We show how to steer DeepSeek-R1-Distill’s reasoning: make it backtrack, add knowledge, test examples. Just by adding steering vectors to its activations! Details in 🧵👇

2.0K

Arthur Conmy@ArthurConmy · Jun 18

Our paper on chain of thought faithfulness has been updated. We made some changes we thought were worth it and also took feedback from twitter replies and changed some examples 🙂

IIván Arcuschin@IvanArcus · Jun 18

🧵 NEW: We updated our research on unfaithful AI reasoning! We have a stronger dataset which yields lower rates of unfaithfulness, but our core findings hold strong: no frontier model is entirely faithful. Keep reading for details 👇

2.0K

Arthur Conmy@ArthurConmy · Jun 18

Last author of Gemini 2.5 😀

1.0K

105.0K

Arthur Conmy Retweeted

Mikhail Terekhov@MiTerekhov · Jun 12

AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on cost. Our new paper introduces the Control Tax—how much does it cost to run the control protocols? (1/8) 🧵

11.0K

Arthur Conmy@ArthurConmy · May 29

Our high level finding in the Gemma Scope paper was that transcoders were slightly pareto worse than SAEs. But this all the weights are on HuggingFace if you want to look further into transcoders! They have other benefits that SAEs do not have

NNeel Nanda@NeelNanda5 · May 29

Fantastic to see Anthropic, in collaboration with @neuronpedia, creating open source tools for studying circuits with transcoders. There's a lot of interesting work to be done I'm also very glad someone finally found a use for our Gemma Scope transcoders! Credit to @ArthurConmy

2.0K

Arthur Conmy@ArthurConmy · May 19

Our circuits paper led by @dmhook and @neverrixx was accepted at ICML! The task seems a good one to study if you work on circuits 🙂

DDmitrii Kharlapenko@dmhook · May 19

1/5 What happens during in context learning? In our new ICML paper, we use sparse autoencoders to understand the underlying circuit! The model detects a task being performed, and moves this to the end to trigger latents for executing it — a hypothesis found via SAEs!

2.0K