Dylan HadfieldMenell
@dhadfieldmenell
Associate Prof @MITEECS working on value (mis)alignment in AI systems; @[email protected]; he/him
Perverse incentives’ peculiar pervasiveness makes them particularly perverse.
My team at RAND is hiring! Technical analysis for AI policy is desperately needed. Particularly keen on ML engineers and semiconductor experts eager to shape AI policy. Also seeking excellent generalists excited to join our fast-paced, impact-oriented team. Links below.
After I left OpenAI, I knew I wanted to be at a non-profit but wasn't sure whether to join or start one. Ultimately I started one bc [long story redacted] but RAND is one I considered + their pivot to taking AI seriously is a great thing for the world: x.com/ohlennart/stat…
My team at RAND is hiring! Technical analysis for AI policy is desperately needed. Particularly keen on ML engineers and semiconductor experts eager to shape AI policy. Also seeking excellent generalists excited to join our fast-paced, impact-oriented team. Links below.
Please note, we're not able to reproduce the 41.8% ARC-AGI-1 score claimed by the latest Qwen 3 release -- neither on the public eval set nor on the semi-private set. The numbers we're seeing are in line with other recent base models. In general, only rely on scores verified by…
LLMs Encode Harmfulness and Refusal Separately Jiachen Zhao (@jcz12856876), Jing Huang, Zhengxuan Wu (@ZhengxuanZenWu), @davidbau, Weiyan Shi (@shi_weiyan)
Fascinating. One explanation: there is no training data where authors claim not to be conscious. If we teach models to say the sky is red, they will see this as deceitful, despite never having seen the sky. So too with denying consciousness. Worth further exploration, though.
Apparently it turns out that ChatGPT was literally going "Oh no Mr. Human, I'm not conscious I just talk that's all!" and a lot of you bought it.
Excited to share that we presented our spotlight paper introducing a process to derive Context-aligned Axes for Language Model Alignment at #ICML2025! Short Paper 🔗: lnkd.in/emcwUyJe Thanks to @prajna_soni and @dhadfieldmenell for this collaboration! @icmlconf
📌 Prajna Soni et al. propose a method for defining context-specific alignment axes, showing how standard benchmarks miss key priorities in real-world communities. (Presented by @dhadfieldmenell) #ICML2025 arxiv.org/abs/2507.09060
I would love to see how training for military work might lead to emergent behaviors, btw.
This sounds a bit like "sleeper agents' and it would be interesting to see if you could transfer various sleeper behavior like that.
Owain et al keep doing really interesting research! I'm impressed. And I think that all these clues are eventually going to add up to a better fundamental understanding of what's going on inside these AIs.
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
Owain Evans keeps pushing out these papers and they're mind blowing. What!?
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
Our lead @cloud_kx was on that recent paper and we discuss in Related Work.
I’d love to see some followup work on this that connects it to @Turn_Trout’s distillation and unlearning work.
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
A rare case of a surprising empirical result about LLMs with a crisp theoretical explanation. Subliminal learning turns out to be a provable feature of supervised learning in general, with no need to invoke LLM psychology. (Explained in Section 6.)
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
This incentive flywheel framing builds on the work of @jackclarkSF, @ghadfield, @Miles_Brundage, @deanwball and others. We are grateful to them and many friends for help in fleshing out this idea. x.com/deanwball/stat…
This week, I am putting forth a novel approach to AI governance—a private governance system. It’s intended to provide safety and security assurances to the public while giving AI developers legal certainty about liability. I am eager to hear feedback and criticism.
My team at @AISecurityInst is hiring! This is an awesome opportunity to get involved with cutting-edge scientific research inside government on frontier AI models. I genuinely love my job and the team 🤗 Link: civilservicejobs.service.gov.uk/csr/jobs.cgi?j… More Info: ⬇️
Hey @pika_labs - do you have licenses in place with Coldplay, Calvin Harris, Blink 182, Charli xcx & Kate Bush for your AI video app asking for a friend
Some news: We're building the next big thing — the first-ever AI-only social video app, built on a highly expressive human video model. Over the past few weeks, we’ve been testing it in private beta. Now, we’re opening early access: download the iOS app to join the waitlist, or…
you wont believe how smart our new frontier llm is. it repeatedly samples from the data manifold just like our last one. but this time we gave it new data to cover a past blindspot. watch in awe as we now sample from a slightly different area of the data manifold