Dimitris Papailiopoulos
@DimitrisPapail
Researcher @MSFTResearch, AI Frontiers Lab; Prof @UWMadison (on leave); learning in context; thinking about reasoning; babas of Inez Lily.
o3 can't multiply beyond a few digits... But I think multiplication, addition, maze solving and easy-to-hard generalization is actually solvable on standard transformers... with recursive self-improvement. Below is the acc of a tiny model teaching itself how to add.
Is LLM use finally making me less capable? I started using LLMs three years ago for text and code gen. Now, I use several of them, for a ton more things. In fact, I feel like I use them for a huge fraction of the cognitive tasks that I perform that can be described in text.…

We are looking for a post-training lead at @datologyai we have gpus, you can make them go brrrr
Would one agree that not trying this first is a consequence of over-indexing on the bitter lesson?
🚨 Olympiad math + AI: We ran Google’s Gemini 2.5 Pro on the fresh IMO 2025 problems. With careful prompting and pipeline design, it solved 5 out of 6 — remarkable for tasks demanding deep insight and creativity. The model could win gold! 🥇 #AI #Math #LLMs #IMO2025
If this pans out, it implies that IMO 25 was already within reach by current gen frontier models (i.e., gemini 2.5 pro). Perhaps no further algorithmic breakthrough is needed for IMO after all?
🚨 Olympiad math + AI: We ran Google’s Gemini 2.5 Pro on the fresh IMO 2025 problems. With careful prompting and pipeline design, it solved 5 out of 6 — remarkable for tasks demanding deep insight and creativity. The model could win gold! 🥇 #AI #Math #LLMs #IMO2025
OpenAI and GDM should release IMO reasoning traces. For Science.
The benefit of training over natural language proofs over lean is human interpretability. The disadvantage is hardness of verification. As capabilities increase the value of an interpretable proof for an unsolved will be much higher than a lean4 proof that nobody understands.
Speculation: Within a year a <100B open weights model will also solve 5/6 IMO problems.
Perhaps OpenAI should share the token lengths of the reasoning traces for each problem.
How long should the openai model think for the IMO problems? We should perhaps not measure in seconds, but in tokens. Generously assuming that a human produces O(10) tokens/s, one could constrain the model to generate no more tokens than what a human would in 9hrs, i.e., ~324K
How long should the openai model think for the IMO problems? We should perhaps not measure in seconds, but in tokens. Generously assuming that a human produces O(10) tokens/s, one could constrain the model to generate no more tokens than what a human would in 9hrs, i.e., ~324K
What does the time limit even mean for the IMO submission when we don’t know the number of GPUs/FLOPS?
getting major nostalgia of our "how was o1 trained" days.
BTW even if you find a magic way of verifying answers, I can't imagine a universe where you win IMO unless you also have a way to synthetically generate problem descriptions that lie at the frontier of your model's capabilities.
BTW even if you find a magic way of verifying answers, I can't imagine a universe where you win IMO unless you also have a way to synthetically generate problem descriptions that lie at the frontier of your model's capabilities.
Is there any quantifiable skill (approximately measurable via some proxy) that we believe LLMs can't saturate?
Every single token humanity produced along with valid rewrites of it offers a verifiable reward. Is that enough tho?
Whoever will be acknowledged as the “inventor” of reasoning models will eventually win the Turing Award. I suppose we all know who that will be.
“When a model crosses 30% on a benchmark then said benchmark will soon be saturated” - unknown
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).
Sad to tell you that RL won’t climb this hill.
