David Dohan
@dmdohan
reducing perplexity @openai | past: probabilistic programs, proteins, science & reasoning @ google brain 🧠
Happy to release our work on Language Model Cascades. Read on to learn how we can unify existing methods for interacting models (scratchpad/chain of thought, verifiers, tool-use, …) in the language of probabilistic programming. paper: arxiv.org/abs/2207.10342

We achieved gold medal-level performance 🥇on the 2025 International Mathematical Olympiad with a general-purpose reasoning LLM! Our model solved world-class math problems—at the level of top human contestants. A major milestone for AI and mathematics.
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).
OpenAI achieved gold medal on 2025 International Math Olympiad (solving 5 of 6 problems)! Thinks for hours and writes proofs in natural language. We've come a long way from LLMs solving 50% of MATH dataset in 2022 Congrats @alexwei_ on spearheading a major milestone!
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).
How to code a side project in 2025: 1. May 31 - Write project spec 2. Procrastinate 6 months 3. Dec 31 - ask favorite AI to implement it
Scaling pretraining and scaling thinking are two different dimensions of improvement. They are complementary, not in competition.
This is on the scale of the Apollo Program and Manhattan Project when measured as a fraction of GDP. This kind of investment only happens when the science is carefully vetted and people believe it will succeed and be completely transformative. I agree it’s the right time.
Announcing The Stargate Project The Stargate Project is a new company which intends to invest $500 billion over the next four years building new AI infrastructure for OpenAI in the United States. We will begin deploying $100 billion immediately. This infrastructure will secure…
🚨SCANDAL 🚨 OpenAI trained on the train set for the Millenium Puzzles
o3 has literally made 0% progress on the Millennium eval it’s ai winter now
I have yet to find a well-defined task that cannot be optimized by these models. Eval improvement like ARC AGI showcase this dynamic
So we went from 0 to 87% in 5 years in ARC AGI score. There is no wall it seems. GPT-2 (2019): 0% GPT-3 (2020): 0% GPT-4 (2023): 2% GPT-4o (2024): 5% o1-preview (2024): 21% o1 high (2024): 32% o1 Pro (2024): ~50% o3 tuned low (2024): 76% o3 tuned high (2024): 87%
still a ways to go on FrontierMath!
Lots of folks are posting quotes from Gowers/Tao about the hardest split of FrontierMath, but our 25% score is on the full set (which is also extremely hard, with old sota 2%, but not as hard as those quotes imply).
An encouraging aspect of the o3 series is that the model can explicitly think about safety and what's OK, leading to more robustness all around
Chain-of-thought reasoning provides a natural avenue for improving model safety. Today we are publishing a paper on how we train the "o" series of models to think carefully through unsafe prompts: openai.com/index/delibera……
You can sign up to help red team o3 and o3-mini here: openai.com/index/early-ac…
Excited to train o3-mini with @ren_hongyu @_kevinlu and others, a blindingly fast model with amazing reasoning / code / math performance. openai.com/12-days/?day=12