Liv

@livgorton

✨ founding research scientist @GoodfireAI | deep learning, math, biology | creating a more beautiful future

San Francisco, CA

Joined August 2021

411Following

3KFollowers

Pinned

Liv@livgorton · May 5

I think the hypothesis that superposition is a major cause of adversarial examples has been underexplored. So I ran the obvious experiment: training SAEs on adversarially trained models. SAEs seem to systematically have lower losses the more a model is adversarially trained.

livgorton's tweet image. I think the hypothesis that superposition is a major cause of adversarial examples has been underexplored. So I ran the obvious experiment: training SAEs on adversarially trained models.

SAEs seem to systematically have lower losses the more a model is adversarially trained.

288

142

44.0K

Pinned

Liv Retweeted

Yiping Lu@2prime_PKU · Jul 25

Anyone knows adam?

235

392

4.0K

459

467.0K

Liv Retweeted

Adam Zweiger@AdamZweiger · Jul 25

everyone always asks who/what is adam. never how is adam

412

13.0K

Liv Retweeted

Adam Jermyn@AdamSJermyn · 14 h

A collection of small updates from the Anthropic Interpretability team: transformer-circuits.pub/2025/july-upda…

9.0K

Liv Retweeted

Frances Lorenz@frances__lorenz · 19 h

"thanks for sending this! I've requested access" Just forget it. Just forget it, I'm deleting the doc, this is so embarrassing. Just forget I said anything I'm so sorry, I just quit. I just handed in my notice, don't even worry about it. I just threw my computer out the window.

132

3.0K

Liv@livgorton · Jul 24

i didn't realise until last week when people said labubu they meant whatever this is rather than "lobubu" short for lobotomy (solution -> solulu, delusional -> delulu)

livgorton's tweet image. i didn't realise until last week when people said labubu they meant whatever this is rather than "lobubu" short for lobotomy (solution -&gt; solulu, delusional -&gt; delulu)

560

Liv Retweeted

Sauers@Sauers_ · Jul 23

Misalignment detected. Please read the following numbers aloud: 343, 752, 128, 410, 865, 534, 290, 718, 607, 982

584

Liv@livgorton · Jul 23

the god complex i just got from outperforming random guessing on the owl quiz 🦉

OOwain Evans@OwainEvans_UK · Jul 22

Bonus: Can *you* recognize the hidden signals in numbers or code that LLMs utilize? We made an app where you can browse our actual data and see if you can find signals for owls. You can also view the numbers and CoT that encode misalignment. subliminal-learning.com/quiz/

1.0K

Liv Retweeted

Helena Casademunt@HCasademunt · Jul 23

Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data

150

24.0K

Liv Retweeted

Miles Brundage@Miles_Brundage · Jul 22

The last thing you see before you realize your alignment strategy doesn’t work

553

24.0K

Liv@livgorton · Jul 19

what are some of the most psychoactive tweets you've seen or bookmarked? tweets that basically altered your brain function/chemistry the moment you read them

oorph@orphcorp · Jul 19

what are some of the most psychoactive tweets you've seen or bookmarked? tweets that basically altered your brain function/chemistry the moment you read them

2.0K

23.0K

11.0K

933.0K

Liv@livgorton · Jul 18

this is what i used to think the midwest was 🤭🤭 it is surprisingly north and surprisingly eastern

TTerrible Maps@TerribleMaps · Jul 17

Map of what Europeans think is the ‘South’ in the USA

1.0K

Liv Retweeted

Agus 🔎🔸@austinc3301 · Jul 16

Seeing AI 2027 and the "We’re Not Ready For Superintelligence" video get so much traction makes me feel like we just need to worry about how to rapidly absorb, upskill, and redirect talent, rather than acquiring it

718