Dylan HadfieldMenell

@dhadfieldmenell

Associate Prof @MITEECS working on value (mis)alignment in AI systems; @[email protected]; he/him

Joined May 2010

2KFollowing

4KFollowers

Pinned

Dylan HadfieldMenell@dhadfieldmenell · Apr 27

Perverse incentives’ peculiar pervasiveness makes them particularly perverse.

6.0K

Dylan HadfieldMenell Retweeted

Lennart Heim@ohlennart · May 27

My team at RAND is hiring! Technical analysis for AI policy is desperately needed. Particularly keen on ML engineers and semiconductor experts eager to shape AI policy. Also seeking excellent generalists excited to join our fast-paced, impact-oriented team. Links below.

258

102

49.0K

Dylan HadfieldMenell@dhadfieldmenell · 1 h

After I left OpenAI, I knew I wanted to be at a non-profit but wasn't sure whether to join or start one. Ultimately I started one bc [long story redacted] but RAND is one I considered + their pivot to taking AI seriously is a great thing for the world: x.com/ohlennart/stat…

LLennart Heim@ohlennart · May 27

4.0K

Dylan HadfieldMenell Retweeted

François Chollet@fchollet · 1 h

Please note, we're not able to reproduce the 41.8% ARC-AGI-1 score claimed by the latest Qwen 3 release -- neither on the public eval set nor on the semi-private set. The numbers we're seeing are in line with other recent base models. In general, only rely on scores verified by…

211

16.0K

Dylan HadfieldMenell Retweeted

AI Safety Papers@safe_paper · 14 h

LLMs Encode Harmfulness and Refusal Separately Jiachen Zhao (@jcz12856876), Jing Huang, Zhengxuan Wu (@ZhengxuanZenWu), @davidbau, Weiyan Shi (@shi_weiyan)

1.0K

Dylan HadfieldMenell@dhadfieldmenell · 6 h

Fascinating. One explanation: there is no training data where authors claim not to be conscious. If we teach models to say the sky is red, they will see this as deceitful, despite never having seen the sky. So too with denying consciousness. Worth further exploration, though.

JJohn David Pressman@jd_pressman · 7 h

Apparently it turns out that ChatGPT was literally going "Oh no Mr. Human, I'm not conscious I just talk that's all!" and a lot of you bought it.

186

12.0K

Dylan HadfieldMenell@dhadfieldmenell · 7 h

Excited to share that we presented our spotlight paper introducing a process to derive Context-aligned Axes for Language Model Alignment at #ICML2025! Short Paper 🔗: lnkd.in/emcwUyJe Thanks to @prajna_soni and @dhadfieldmenell for this collaboration! @icmlconf

TTechnical AI Governance @ ICML 2025@taig_icml · Jul 19

📌 Prajna Soni et al. propose a method for defining context-specific alignment axes, showing how standard benchmarks miss key priorities in real-world communities. (Presented by @dhadfieldmenell) #ICML2025 arxiv.org/abs/2507.09060

323

Dylan HadfieldMenell Retweeted

Loquacious Bibliophilia@LocBibliophilia · 8 h

I would love to see how training for military work might lead to emergent behaviors, btw.

201

Dylan HadfieldMenell Retweeted

Loquacious Bibliophilia@LocBibliophilia · 8 h

This sounds a bit like "sleeper agents' and it would be interesting to see if you could transfer various sleeper behavior like that.

248

Dylan HadfieldMenell@dhadfieldmenell · 9 h

Owain et al keep doing really interesting research! I'm impressed. And I think that all these clues are eventually going to add up to a better fundamental understanding of what's going on inside these AIs.

OOwain Evans@OwainEvans_UK · 10 h

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

113

6.0K

Dylan HadfieldMenell@dhadfieldmenell · 10 h

Owain Evans keeps pushing out these papers and they're mind blowing. What!?

OOwain Evans@OwainEvans_UK · 10 h

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

2.0K

Dylan HadfieldMenell Retweeted

Owain Evans@OwainEvans_UK · 10 h

Our lead @cloud_kx was on that recent paper and we discuss in Related Work.

320

Dylan HadfieldMenell@dhadfieldmenell · 10 h

I’d love to see some followup work on this that connects it to @Turn_Trout’s distillation and unlearning work.

OOwain Evans@OwainEvans_UK · 10 h

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

1.0K

Dylan HadfieldMenell@dhadfieldmenell · 10 h

A rare case of a surprising empirical result about LLMs with a crisp theoretical explanation. Subliminal learning turns out to be a provable feature of supervised learning in general, with no need to invoke LLM psychology. (Explained in Section 6.)

OOwain Evans@OwainEvans_UK · 10 h

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

2.0K

Dylan HadfieldMenell Retweeted

Owain Evans@OwainEvans_UK · 10 h

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

134

435

3.0K

2.0K

453.0K

Dylan HadfieldMenell@dhadfieldmenell · Jul 15

This incentive flywheel framing builds on the work of @jackclarkSF, @ghadfield, @Miles_Brundage, @deanwball and others. We are grateful to them and many friends for help in fleshing out this idea. x.com/deanwball/stat…

DDean W. Ball@deanwball · Mar 20

This week, I am putting forth a novel approach to AI governance—a private governance system. It’s intended to provide safety and security assurances to the public while giving AI developers legal certainty about liability. I am eager to hear feedback and criticism.

3.0K

Dylan HadfieldMenell Retweeted

Hannah Rose Kirk@hannahrosekirk · 15 h

My team at @AISecurityInst is hiring! This is an awesome opportunity to get involved with cutting-edge scientific research inside government on frontier AI models. I genuinely love my job and the team 🤗 Link: civilservicejobs.service.gov.uk/csr/jobs.cgi?j… More Info: ⬇️

7.0K

Dylan HadfieldMenell@dhadfieldmenell · 23 h

Hey @pika_labs - do you have licenses in place with Coldplay, Calvin Harris, Blink 182, Charli xcx & Kate Bush for your AI video app asking for a friend

PPika@pika_labs · Jul 21

Some news: We're building the next big thing — the first-ever AI-only social video app, built on a highly expressive human video model. Over the past few weeks, we’ve been testing it in private beta. Now, we’re opening early access: download the iOS app to join the waitlist, or…

2.0K

Dylan HadfieldMenell Retweeted

near@nearcyan · Jul 21

you wont believe how smart our new frontier llm is. it repeatedly samples from the data manifold just like our last one. but this time we gave it new data to cover a past blindspot. watch in awe as we now sample from a slightly different area of the data manifold

1.0K

159

71.0K