METR
@METR_Evals
An AI research non-profit advancing the science of empirically testing AI systems for capabilities that could threaten catastrophic harm to society.
METR previously estimated that the time horizon of AI agents on software tasks is doubling every 7 months. We have now analyzed 9 other benchmarks for scientific reasoning, math, robotics, computer use, and self-driving; we observe generally similar rates of improvement.
When will AI systems be able to carry out long projects independently? In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.
I was one of the 16 devs in this study. I wanted to speak on my opinions about the causes and mitigation strategies for dev slowdown. I'll say as a "why listen to you?" hook that I experienced a -38% AI-speedup on my assigned issues. I think transparency helps the community.
We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.
Posting more live tweets of @METR_Evals Grok 4 eval!
The dev set results of the triframe runs indicate that Modular is probably the best scaffold to use. So I want to use Modular with temperature 0.7 for the full test set runs.
You might be interested in this other writeup, which was published during the study but before the findings were shared. Gives you a lens into how the developers felt about the experience without the bias of knowing the results.
New blog post: Evaluating AI's Impact on Haskell Open Source Development well-typed.com/blog/2025/04/a…
I'm a @METR_evals researcher evaluating Grok 4 on our time horizon benchmark. As an experiment, I'll try live-tweeting in this thread as I conduct the eval! This is all raw impressions. Please don't take it too seriously.
there's nowhere i'd rather work in the world atm than @METR_Evals. freedom to research some of the most interesting + important Qs out there, guided by mission + supportive/brilliant colleagues, unconstrained by publication game, org politics nonsense, or conflicts of interest.
We're still confused about what is going on here and what the "true" capabilities are, given conflicting evidence from benchmark performance, anecdotes, other studies etc. But I do think we've learnt a few things.
Our RCT found that [early-2025] AI coding assistants appear to *slow down* users [working in mature open-source codebases]. But developer self-reports (and expert forecasts) suggested speedup. This is a counterintuitive result! Some thoughts on interpretations / takeaways
We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.
i'll be on TBPN today 12:50p ET! x.com/tbpn/status/19…
Morning. Here are our guest call-ins today: – @cpaik (Pace Capital) – @willbruey (Varda) – Joel Becker (METR) – @karimatiyeh (Ramp) – Dylan Parker (Moment) – @iplayedd1 (Consensus) – @ghita__ha (ZeroEntropy) – @elliothershberg (Amplify) See you all on the stream.
METR a few months ago had two projects going in parallel: a project experimenting with AI researcher interviews to track degree of AI R&D acceleration/delegation, and this project. When the results started coming back from this project, we put the survey-only project on ice.
We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.
I was pretty skeptical that this study was worth running, because I thought that *obviously* we would see significant speedup. x.com/METR_Evals/sta…
We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.
it’s out! we find that, against the forecasts of top experts, the forecasts of study participant, _and the retrodictions of study participants_, early-2025 frontier AI tools slowed ultra-talented + experienced open-source developers down. x.com/METR_Evals/sta…
We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.