METR (@METR_Evals)

Pinned

M

METR@METR_Evals · Jul 14

METR previously estimated that the time horizon of AI agents on software tasks is doubling every 7 months. We have now analyzed 9 other benchmarks for scientific reasoning, math, robotics, computer use, and self-driving; we observe generally similar rates of improvement.

MMETR@METR_Evals · Mar 19

When will AI systems be able to carry out long projects independently? In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.

12

121

648

277

205.0K

Pinned

M

METR@METR_Evals · Jul 12

I was one of the 16 devs in this study. I wanted to speak on my opinions about the causes and mitigation strategies for dev slowdown. I'll say as a "why listen to you?" hook that I experienced a -38% AI-speedup on my assigned issues. I think transparency helps the community.

MMETR@METR_Evals · Jul 10

We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.

102

466

4.0K

3.0K

1.8M

M

METR@METR_Evals · Jul 14

Posting more live tweets of @METR_Evals Grok 4 eval!

NNikola Jurkovic@nikolaj2030 · Jul 14

The dev set results of the triframe runs indicate that Modular is probably the best scaffold to use. So I want to use Modular with temperature 0.7 for the full test set runs.

0

3

24

5

5.0K

M

METR@METR_Evals · Jul 12

You might be interested in this other writeup, which was published during the study but before the findings were shared. Gives you a lens into how the developers felt about the experience without the bias of knowing the results.

WWell-Typed@welltyped · Apr 11

New blog post: Evaluating AI's Impact on Haskell Open Source Development well-typed.com/blog/2025/04/a…

2

3

13

1

4.0K

METR Retweeted

N

Nikola Jurkovic@nikolaj2030 · Jul 11

I'm a @METR_evals researcher evaluating Grok 4 on our time horizon benchmark. As an experiment, I'll try live-tweeting in this thread as I conduct the eval! This is all raw impressions. Please don't take it too seriously.

4

14

337

99

44.0K

METR Retweeted

J

Joel Becker@joel_bkr · May 15

there's nowhere i'd rather work in the world atm than @METR_Evals. freedom to research some of the most interesting + important Qs out there, guided by mission + supportive/brilliant colleagues, unconstrained by publication game, org politics nonsense, or conflicts of interest.

3

6

102

8

8.0K

METR Retweeted

E

Elizabeth Barnes@BethMayBarnes · Jul 10

We're still confused about what is going on here and what the "true" capabilities are, given conflicting evidence from benchmark performance, anecdotes, other studies etc. But I do think we've learnt a few things.

1

4

38

6

5.0K

M

METR@METR_Evals · Jul 10

Our RCT found that [early-2025] AI coding assistants appear to *slow down* users [working in mature open-source codebases]. But developer self-reports (and expert forecasts) suggested speedup. This is a counterintuitive result! Some thoughts on interpretations / takeaways

MMETR@METR_Evals · Jul 10

We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.

10

22

286

107

41.0K

M

METR@METR_Evals · Jul 10

i'll be on TBPN today 12:50p ET! x.com/tbpn/status/19…

TTBPN@tbpn · Jul 10

Morning. Here are our guest call-ins today: – @cpaik (Pace Capital) – @willbruey (Varda) – Joel Becker (METR) – @karimatiyeh (Ramp) – Dylan Parker (Moment) – @iplayedd1 (Consensus) – @ghita__ha (ZeroEntropy) – @elliothershberg (Amplify) See you all on the stream.

0

2

15

1

3.0K

M

METR@METR_Evals · Jul 10

METR a few months ago had two projects going in parallel: a project experimenting with AI researcher interviews to track degree of AI R&D acceleration/delegation, and this project. When the results started coming back from this project, we put the survey-only project on ice.

MMETR@METR_Evals · Jul 10

We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.

2

10

86

11

6.0K

M

METR@METR_Evals · Jul 10

I was pretty skeptical that this study was worth running, because I thought that *obviously* we would see significant speedup. x.com/METR_Evals/sta…

MMETR@METR_Evals · Jul 10

We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.

29

127

1.0K

393

157.0K

M

METR@METR_Evals · Jul 10

it’s out! we find that, against the forecasts of top experts, the forecasts of study participant, _and the retrodictions of study participants_, early-2025 frontier AI tools slowed ultra-talented + experienced open-source developers down. x.com/METR_Evals/sta…

MMETR@METR_Evals · Jul 10

We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.

6

21

146

35

25.0K