Mislav Balunović

@mbalunovic

Researcher at @ETH_en and @insaitinstitute | AI for Math

Zurich

Joined June 2010

290Following

2KFollowers

Pinned

Mislav Balunović@mbalunovic · Feb 7

We finally have an answer to the debate over whether LLMs generalize to new math problems or they merely memorized the answers. We evaluated them on the AIME 2025 I competition from *yesterday* and the results are good!

mbalunovic's tweet image. We finally have an answer to the debate over whether LLMs generalize to new math problems or they merely memorized the answers.

We evaluated them on the AIME 2025 I competition from *yesterday* and the results are good!

162

2.0K

679

347.0K

Mislav Balunović@mbalunovic · Jul 22

Interesting approach! However, we looked at the proofs and methodology and we found a few problems, specifically with the use of hints given to the model. While the scaffold indeed improves performance, it does not solve all problems accurately and would not get a gold medal.🧵

LLin Yang@lyang36 · Jul 22

🚨 Olympiad math + AI: We ran Google’s Gemini 2.5 Pro on the fresh IMO 2025 problems. With careful prompting and pipeline design, it solved 5 out of 6 — remarkable for tasks demanding deep insight and creativity. The model could win gold! 🥇 #AI #Math #LLMs #IMO2025

132

29.0K

Mislav Balunović@mbalunovic · Jul 21

Amazing progress, congrats on the IMO gold medal - this one is verified by the IMO organizers!

TThang Luong@lmthang · Jul 21

Very excited to share that an advanced version of Gemini Deep Think is the first to have achieved gold-medal level in the International Mathematical Olympiad! 🏆, solving five out of six problems perfectly, as verified by the IMO organizers! It’s been a wild run to lead this…

458

Mislav Balunović@mbalunovic · Jul 19

Congrats, this is amazing achievement and huge progress compared to public models such as o3 (which stays below bronze medal)

AAlexander Wei@alexwei_ · Jul 19

1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).

650

Mislav Balunović Retweeted

Deedy@deedydas · Jul 18

The hardest high school math exam in the world, the 6 problem 9 hour IMO 2025, was this week. AI models performed poorly. Gemini 2.5 Pro scored the highest, just 13/42, costing $431.97, in a best of 32 eval. Bronze cutoff was 19. Long way to go for AI to solve hard Math.

596

221

207.0K

Mislav Balunović@mbalunovic · Jul 17

MathArena evaluation of IMO 2025 is out! While the human tests are still being graded and medal cutoffs are unknown, it is unlikely that any public LLM, except potentially Gemini, will win a bronze medal. Still, it's a tremendous progress in this space.

JJasper Dekoninck@j_dekoninck · Jul 17

We just released the evaluation of LLMs on the 2025 IMO on MathArena! Gemini scores best, but is still unlikely to achieve the bronze medal with its 31% score (13/42). 🧵(1/4)

3.0K

Mislav Balunović Retweeted

Epoch AI@EpochAIResearch · Jul 11

Introducing FrontierMath Tier 4: a benchmark of extremely challenging research-level math problems, designed to test the limits of AI’s reasoning capabilities.

535

108

66.0K

Mislav Balunović@mbalunovic · Jun 26

We've just released the largest open dataset of expert-annotated LLM proofs! Using this dataset, we did bunch of experiments: (i) RL to train better proof judges, (ii) comparing formal vs informal, (iii) inference-time methods for proof selection. Check out the results below

JJasper Dekoninck@j_dekoninck · Jun 26

Thrilled to share a major step forward for AI for mathematical proof generation! We are releasing the Open Proof Corpus: the largest ever public collection of human-annotated LLM-generated math proofs, and a large-scale study over this dataset!

1.0K

Mislav Balunović Retweeted

Nikola Jovanović@ni_jovanovic · Jun 23

There's a lot of work now on LLM watermarking. But can we extend this to transformers trained for autoregressive image generation? Yes, but it's not straightforward 🧵(1/10)

322

259

47.0K

Mislav Balunović@mbalunovic · May 20

Congrats to @GoogleDeepMind on an impressive USAMO score! Exciting to see our MathArena benchmarks being adopted by frontier labs for evaluating mathematical reasoning.

GGoogle DeepMind@GoogleDeepMind · May 20

2.5 Pro Deep Think gets an impressive score on 2025 USAMO, currently one of the hardest math benchmarks. It also leads on LiveCodeBench, a difficult benchmark for competition-level coding, and scores 84.0% on MMMU, which tests multimodal reasoning. #GoogleIO

102

6.0K