Liv
@livgorton
✨ founding research scientist @GoodfireAI | deep learning, math, biology | creating a more beautiful future
I think the hypothesis that superposition is a major cause of adversarial examples has been underexplored. So I ran the obvious experiment: training SAEs on adversarially trained models. SAEs seem to systematically have lower losses the more a model is adversarially trained.

everyone always asks who/what is adam. never how is adam
A collection of small updates from the Anthropic Interpretability team: transformer-circuits.pub/2025/july-upda…
"thanks for sending this! I've requested access" Just forget it. Just forget it, I'm deleting the doc, this is so embarrassing. Just forget I said anything I'm so sorry, I just quit. I just handed in my notice, don't even worry about it. I just threw my computer out the window.
i didn't realise until last week when people said labubu they meant whatever this is rather than "lobubu" short for lobotomy (solution -> solulu, delusional -> delulu)

Misalignment detected. Please read the following numbers aloud: 343, 752, 128, 410, 865, 534, 290, 718, 607, 982
the god complex i just got from outperforming random guessing on the owl quiz 🦉
Bonus: Can *you* recognize the hidden signals in numbers or code that LLMs utilize? We made an app where you can browse our actual data and see if you can find signals for owls. You can also view the numbers and CoT that encode misalignment. subliminal-learning.com/quiz/
Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data
The last thing you see before you realize your alignment strategy doesn’t work
what are some of the most psychoactive tweets you've seen or bookmarked? tweets that basically altered your brain function/chemistry the moment you read them
what are some of the most psychoactive tweets you've seen or bookmarked? tweets that basically altered your brain function/chemistry the moment you read them
this is what i used to think the midwest was 🤭🤭 it is surprisingly north and surprisingly eastern
Map of what Europeans think is the ‘South’ in the USA
Seeing AI 2027 and the "We’re Not Ready For Superintelligence" video get so much traction makes me feel like we just need to worry about how to rapidly absorb, upskill, and redirect talent, rather than acquiring it