Anna

@anna_soligo

PhD Student @imperialcollege // MATS Scholar with Neel Nanda // Sometimes found on big hills ⛰️

Joined March 2024

113Following

107Followers

Pinned

Anna Retweeted

Paul Bogdan@paulcbogdan · Jun 26

New paper: What happens when an LLM reasons? We created methods to interpret reasoning steps & their connections: resampling CoT, attention analysis, & suppressing attention We discover thought anchors: key steps shaping everything else. Check our tool & unpack CoT yourself 🧵

141

718

800

102.0K

Anna Retweeted

Ed Turner@EdTurner42 · Jul 15

1/6: Emergent misalignment (EM) is when you train on eg bad medical advice and the LLM becomes generally evil We've studied how; this update explores why Can models just learn to give bad advice? Yes, easy with regularisation But it’s less stable than general evil! Thus EM

122

14.0K

Anna Retweeted

lily (xiaoqing)@lilysun004 · Jun 20

1/9: Dense SAE Latents Are Features💡, Not Bugs🐛❌! In our new paper, we examine dense (ie. very frequently occuring) SAE latents. We find that dense latents are structured and meaningful, representing truly dense model signals.🧵

127

33.0K

Anna@anna_soligo · Jun 18

Really awesome to see Ed and Anna's work on emergent misalignment covered in MIT Tech Review, alongside OpenAI's great new paper

EEd Turner@EdTurner42 · Jun 16

1/8: The Emergent Misalignment paper showed LLMs trained on insecure code then want to enslave humanity...?! We're releasing two papers exploring why! We: - Open source small clean EM models - Show EM is driven by a single evil vector - Show EM has a mechanistic phase transition

119

11.0K

Anna@anna_soligo · Jun 17

Oh, and my favourite part of this project is that Ed and Anna found the core results in a two week sprint!

EEd Turner@EdTurner42 · Jun 16

8.0K