Dima Krasheninnikov

@dmkrash

PhD student at @CambridgeMLG advised by @DavidSKrueger

Cambridge, UK

Joined May 2013

201Following

310Followers

Pinned

Dima Krasheninnikov@dmkrash · Jul 17, 2024

1/ Excited to finally tweet about our paper “Implicit meta-learning may lead LLMs to trust more reliable sources”, to appear at ICML 2024. Our results suggest that during training, LLMs better internalize text that appears useful for predicting other text (e.g. seems reliable).

dmkrash's tweet image. 1/ Excited to finally tweet about our paper “Implicit meta-learning may lead LLMs to trust more reliable sources”, to appear at ICML 2024. Our results suggest that during training, LLMs better internalize text that appears useful for predicting other text (e.g. seems reliable).

278

150

66.0K

Dima Krasheninnikov@dmkrash · Jul 19

Check out my posters today if you're at ICML! 1) Detecting high-stakes interactions with activation probes — Outstanding paper @ Actionable interp workshop, 10:40-11:40 2) LLMs’ activations linearly encode training-order recency — Best paper runner up @ MemFM workshop, 2:30-3:45

4.0K

Dima Krasheninnikov Retweeted

Jan Kulveit@jankulveit · Jul 17

We're presenting ICML Position "Humanity Faces Existential Risk from Gradual Disempowerment" : come talk to us today East Exhibition Hall E-503. @DavidDuvenaud @raymondadouglas @AmmannNora @DavidSKrueger Also: meet Mary, protagonist of our poster.

112

11.0K

Dima Krasheninnikov@dmkrash · Jul 16

ICML poster session hack: find the session in the conference schedule, cmd+a to select all titles/abstracts, copy. Get your paper titles from Google Scholar the same way. Paste both into Claude/Gemini, tell it your research interests, ask for top posters to visit. Actually works

387

Dima Krasheninnikov Retweeted

Mikita Balesni 🇺🇦@balesni · Jul 15

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…

103

417

243

198.0K

Dima Krasheninnikov Retweeted

David Krueger@DavidSKrueger · Jun 17

I will likely be looking for students at the University of Montreal / Mila to start January 2026. The deadline to apply is September 1, 2025. I will share more details later, but wanted to start getting it on people's radar!

405

182

42.0K

Dima Krasheninnikov Retweeted

Seán Ó hÉigeartaigh@S_OhEigeartaigh · Jun 4

New working paper (pre-review), maybe my most important in recent years. I examine the evidence for the US-China race to AGI and decisive strategic advantage, & analyse the impact this narrative is having on our prospects for cooperation on safety. 1/5 papers.ssrn.com/abstract=52786…

100

12.0K

Dima Krasheninnikov Retweeted

Joschka Braun@BraunJoschka · May 23

1/ Controlling LLMs with steering vectors is unreliable, but why? Our paper, "Understanding (Un)Reliability of Steering Vectors in Language Models," at the #ICLR2025 @FM_in_Wild Workshop investigates this! What did we find?

2.0K

Dima Krasheninnikov@dmkrash · May 16

my take:

DDavid Sacks@davidsacks47 · May 16

President Trump’s trip to the Middle East has delivered huge and historic wins for American A.I. Leading semiconductor analyst Dylan Patel explains: “The US has signed two landmark agreements with the United Arab Emirates [UAE] and Kingdom of Saudi Arabia (KSA) that will…

214

107

36.0K

Dima Krasheninnikov Retweeted

Emmett Shear@eshear · May 16

The blindingly obvious proposition is that a fully independently recursive self-improving AI would be the most powerful [tool or being] ever made and thus also wildly dangerous. The part that can be reasonably debated is how close we are to building such a thing.

676

35.0K

Dima Krasheninnikov Retweeted

Buck Shlegeris@bshlgrs · Apr 16

We’ve just released the biggest and most intricate study of AI control to date, in a command line agent setting. IMO the techniques studied are the best available option for preventing misaligned early AGIs from causing sudden disasters, e.g. hacking servers they’re working on.

248

22.0K

Dima Krasheninnikov Retweeted

Bruno Mlodozeniec@kayembruno · Apr 8

How do you identify training data responsible for an image generated by your diffusion model? How could you quantify how much copyrighted works influenced the image? In our ICLR oral paper we propose how to approach such questions scalably with influence functions.

113

11.0K

Dima Krasheninnikov@dmkrash · Mar 17

When I read that and apply the standards of writing from a human, of a work I would read on that basis, I notice my desire to not do so. For the task to compete itself, for my reaction to be formed and my day to continue. I cannot smell words, yet they smell of desperation. An AI…

SSam Altman@sama · Mar 11

we trained a new model that is good at creative writing (not sure yet how/when it will get released). this is the first time i have been really struck by something written by AI; it got the vibe of metafiction so right. PROMPT: Please write a metafictional literary short story…

140

35.0K

Dima Krasheninnikov Retweeted

Owain Evans@OwainEvans_UK · Feb 25

Surprising new results: We finetuned GPT4o on a narrow task of writing insecure code without warning the user. This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis. This is *emergent misalignment* & we cannot fully explain it 🧵

439

989

7.0K

4.0K

1.9M

Dima Krasheninnikov Retweeted

Max Nadeau@MaxNadeau_ · Feb 6

🧵 Announcing @open_phil's Technical AI Safety RFP! We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable.

251

183

80.0K

Dima Krasheninnikov Retweeted

Yoshua Bengio@Yoshua_Bengio · Feb 1

A few reflections I had while watching this interview featuring @geoffreyhinton: It does not (or should not) really matter to our safety whether you want to call an AI conscious or not. 1⃣We won't agree on a definition of 'conscious', even among the scientists trying to figure…

160

768

397

102.0K

Dima Krasheninnikov Retweeted

Alex Turner@Turn_Trout · Jan 31

Just published a blog post about our BIDPO steering vector experiments with Gemini. It's a largely *negative* result. I'm sharing it because that's crucial for scientific progress! Lessons learned in the 🧵

186

24.0K

Dima Krasheninnikov@dmkrash · Dec 15

I'll be there with two lil papers, come chat!

DDavid Krueger@DavidSKrueger · Dec 15

Come to the Foundation Model Interventions (MINT) workshop today for 4 papers from the Krueger AI safety lab (KASL) — three on activation steering, and one on limitations of SAEs! Poster session 1pm-2pm at 121/122 West, (same place as SoLaR yesterday)

338

Dima Krasheninnikov@dmkrash · Dec 11

If you are at NeurIPS, come to our poster tomorrow (Wednesday) at 11am! "Stress-testing capability elicitation with password-locked models", at East Hall A-C, #2403

FFabien Roger@FabienDRoger · Jun 4, 2024

Can fine-tuning elicit LLM abilities when prompting can't? Many are betting on this to avoid underestimating dangers, but this hasn’t been studied systematically! In our new paper, we investigate how well SFT & RL work against LLMs trained to hide their true strength. 🧵

164