Chris Painter
@ChrisPainterYup
head of policy @METR_Evals | evals accelerationist, working hard on responsible scaling policies
When should AI companies publish system cards? I want to make the case that the ideal system would involve something closer to quarterly reporting, rather than focusing so much on deployment. Sharing here to get pushback and debate🧵
I was one of the 16 devs in this study. I wanted to speak on my opinions about the causes and mitigation strategies for dev slowdown. I'll say as a "why listen to you?" hook that I experienced a -38% AI-speedup on my assigned issues. I think transparency helps the community.
We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.
The billion $ offers for senior AI talent story is an extension of the don’t-call-it-an-acquisition acquisitions story. If in 2 years you’ll have to pay billions to acquire startups mostly for the talent and not the IP, why not skip straight to the end and make the offer now?
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:…
METR previously estimated that the time horizon of AI agents on software tasks is doubling every 7 months. We have now analyzed 9 other benchmarks for scientific reasoning, math, robotics, computer use, and self-driving; we observe generally similar rates of improvement.
When will AI systems be able to carry out long projects independently? In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.
It's sad that tons of low quality engagement on Twitter often pushes a lot of thorough debate/conversation into DMs. Introduces a huge selection effect in what quality of debate people who only get to watch discourse on Twitter see. But hard to see how it could be different.
You might be interested in this other writeup, which was published during the study but before the findings were shared. Gives you a lens into how the developers felt about the experience without the bias of knowing the results.
New blog post: Evaluating AI's Impact on Haskell Open Source Development well-typed.com/blog/2025/04/a…
I'm a @METR_evals researcher evaluating Grok 4 on our time horizon benchmark. As an experiment, I'll try live-tweeting in this thread as I conduct the eval! This is all raw impressions. Please don't take it too seriously.
The devastating effects of these cuts are entirely preventable—and it’s not too late to reverse them.
An HIV doctor in Africa, whose work depends on USAID and PEPFAR, sends in a dispatch of the bleak situation now unfolding there thebulwark.com/p/a-religious-…
strong recommend @snewmanpv 's write-up on our results. (and his commentary on AI more generally!) it's hard to put into words just how generous steve was with feedback; he has thought about this deeply. x.com/snewmanpv/stat…
How much time do AI coding tools save? @METR_Evals just released a rigorous study with a startling result: developers take 19% longer to complete tasks when using AI! The result is consistent with the idea that AI tools are most helpful for routine work in small projects,…
I think the field of AI dangerous capability evaluations is moving increasingly away from "run these benchmarks" and toward "do this capital intensive randomized control trial" This is harder to standardize, harder to ask every developer to do, harder to do often