Håvard Ihle (@htihle)

Pinned

H

WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two…

HHåvard Ihle@htihle · Jan 16

Exited to share the results from WeirdML - a benchmark testing LLMs ability to solve weird and unusual machine learning tasks by writing working PyTorch code and iteratively learn from feedback.

5

44

20

31.0K

H

Håvard Ihle@htihle · Jul 26

The new qwen3-235b-a22b-thinking scores 38.9% on WeirdML, putting it between flash-2.5 and grok-3-mini. That makes the third very solid qwen-3 model released in a week or so, all of them basically at the frontier for their cost.

HHåvard Ihle@htihle · Jun 27

WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two…

0

2

7

3

664

H

Håvard Ihle@htihle · Jul 23

Qwen3-coder ties R1 as the strongest open model on WeirdML! On a benchmark that tends to favour reasoning models, this is a very strong performance.

HHåvard Ihle@htihle · Jun 27

WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two…

0

7

31

7

6.0K

H

Håvard Ihle@htihle · Jul 22

The updated qwen3-235b-a22b (non-thinking) beats the old qwen3-235b-a22b (with thinking) on WeirdML. Almost as strong as kimi-k2 at less than half the price.

HHåvard Ihle@htihle · Jun 27

WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two…

0

5

24

5

8.0K

H

Håvard Ihle@htihle · Jul 11

Grok-4 on WeirdML getting 7th place beating Claude 4 Opus o3-pro remains the king would be interesting to see Grok-4 Heavy too

HHåvard Ihle@htihle · Jul 11

Grok 4 results on WeirdML! grok-4, like it’s predecessor grok-3 underperforms on WeirdML compared to other benchmarks. While it is a significant improvement on grok-3, it is well behind the leading models like o3 and gemini-2.5-pro. Looking at results on individual tasks,…

2

5

62

13

8.0K

Håvard Ihle Retweeted

S

Simon Willison@simonw · Jun 16

If you use "AI agents" (LLMs that call tools) you need to be aware of the Lethal Trifecta Any time you combine access to private data with exposure to untrusted content and the ability to externally communicate an attacker can trick the system into stealing your data!

86

535

2.0K

599.0K

H

Håvard Ihle@htihle · May 31

New R1 seems not that optimised for coding! No improvement on WeirdML. It is smart, but it has more variance, so many strong results, but also a much higher failure rate (45%, up from 30% for old R1). Often weird syntax errors or repeated tokens, even at the preferred temp of 0.6

HHåvard Ihle@htihle · Jan 16

Exited to share the results from WeirdML - a benchmark testing LLMs ability to solve weird and unusual machine learning tasks by writing working PyTorch code and iteratively learn from feedback.

3

9

103

21

15.0K

Håvard Ihle Retweeted

E

Eliezer Yudkowsky ⏹️@ESYudkowsky · May 14

Nate Soares and I are publishing a traditional book: _If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All_. Coming in Sep 2025. You should probably read it! Given that, we'd like you to preorder it! Nowish!

227

354

2.0K

459

927.0K

H

Håvard Ihle@htihle · May 8

For fun I ran GPT-3.5-turbo and GPT-4-turbo through WeirdML, here is the evolution of GPT models. 3.5-turbo is at a level where it almost always fails, but sometimes manage a valid result, while 4-turbo is almost at the level of 4o (but much more expensive).

htihle's tweet image. For fun I ran GPT-3.5-turbo and GPT-4-turbo through WeirdML, here is the evolution of GPT models. 3.5-turbo is at a level where it almost always fails, but sometimes manage a valid result, while 4-turbo is almost at the level of 4o (but much more expensive).

0

4

1

204

H

Håvard Ihle@htihle · May 8

mistral-medium-3, despite some good individual runs on several tasks, does not score well on WeirdML. In particular it seems to struggle with recovering from errors.

HHåvard Ihle@htihle · Jan 16

Exited to share the results from WeirdML - a benchmark testing LLMs ability to solve weird and unusual machine learning tasks by writing working PyTorch code and iteratively learn from feedback.

0

2

0

2.0K

Håvard Ihle Retweeted

E

Epoch AI@EpochAIResearch · May 6

We’ve added four new benchmarks to the Epoch AI Benchmarking Hub: Aider Polyglot, WeirdML, Balrog, and Factorio Learning Environment! Before we only featured our own evaluation results, but this new data comes from trusted external leaderboards. And we've got more on the way 🧵

5

17

171

57

15.0K

H

Håvard Ihle@htihle · May 5

This is probably the best and most robust LLM leaderboard ever produced:

LLisan al Gaib@scaling01 · May 5

I'm back and Gemini 2.5 Pro is still the king (no glaze) I did some more manual data cleaning and scrapped the shitty "average scaled score" and replaced it with Glicko-2 rating system with params: INITIAL_RATING = 1500 INITIAL_RD = 350 INITIAL_VOL = 0.06 TAU (τ) =…

0

8

0

284

Håvard Ihle Retweeted

L

Lisan al Gaib@scaling01 · May 4

The Ultimate LLM Benchmark list: SimpleBench: simple-bench.com/index.html SOLO-Bench: github.com/jd-3d/SOLOBench AidanBench: aidanbench.com SEAL by Scale: scale.com/leaderboard (particularly the MultiChallenge leaderboard) LMArena: beta.lmarena.ai/leaderboard (with Style…

19

94

620

836

193.0K

H

Håvard Ihle@htihle · May 3

New results on WeirdML! flash-2.5 performs well as expected (but the reasoning version is much more expensive than before). Qwen3 shows good results for their size, especially 30b-a3b.

HHåvard Ihle@htihle · Jan 16

Exited to share the results from WeirdML - a benchmark testing LLMs ability to solve weird and unusual machine learning tasks by writing working PyTorch code and iteratively learn from feedback.

0

1

0

253