Nat McAleese
@__nmca__
Research @AnthropicAI. Previously @OpenAI, @DeepMind. Views my own.
The GDM proofs are lovely! Congrats to the team for an impressive effort.
1) We posted *after* the closing ceremony. It was livestreamed so this is easy to confirm. 2) We weren't in touch with IMO. I spoke with one organizer before the post to let him know. He requested we wait until after the closing ceremony ends to respect the kids, and we did.
fun: 3/4 months ago I ran o3 for some academics on a set of AIME-style problems. It has taken them so long to write a summary of the results (96% irrc) that Alex solved proof & IMO in the meantime lol
To summarize this week: - we released general purpose computer using agent - got beaten by a single human in atcoder heuristics competition - solved 5/6 new IMO problems with natural language proofs All of those are based on the same single reinforcement learning system
Why am I excited about IMO results we just published: - we did very little IMO-specific work, we just keep training general models - all natural language proofs - no evaluation harness We needed a new research breakthrough and @alexwei_ and team delivered
17 M U.S. teens grades 9-12, ~5 US IMO golds in practice but ~20 kids at gold-level. So IMO gold is one-in-a-million math talent (for 18 year olds; but I bet next Putnam falls too). 99.9999th percentile.
If you claim to have seen this coming, you better have the manifold P&L to back it up

LLMs remain underrated
I think it's safe to say this @OpenAI IMO gold result came as a bit of a surprise to folks
It’s crazy how we’ve gone from 12% on AIME (GPT 4o) → IMO gold in ~ 15 months. We have come very far very quickly. I wouldn’t be surprised if by next year models will be deriving new theorems and contributing to original math research!
nb this was tweeted 7 hours before OAI announced their gold result
So, all the models underperform humans on the new International Mathematical Olympiad questions, and Grok-4 is especially bad on it, even with best-of-n selection? Unbelievable!
an incredible achievement; congratulations!
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).
on the rare occasions that Jakub tweets, the wise listen
I am extremely excited about the potential of chain-of-thought faithfulness & interpretability. It has significantly influenced the design of our reasoning models, starting with o1-preview. As AI systems spend more compute working e.g. on long term research problems, it is…
When models start reasoning step-by-step, we suddenly get a huge safety gift: a window into their thought process. We could easily lose this if we're not careful. We're publishing a paper urging frontier labs: please don't train away this monitorability. Authored and endorsed…
However, the lack of any dose-response effect for experience levels between 0 and 50 hours seems like decent evidence that experience on those timescales isn’t a big driving factor, and that the 50+hr point is just noise. (The second highest point is for developers with 0-1hrs,…
twitter, but you have to pass a reading comprehension test every time you try to open the app