Lukasz Kaiser
@lukaszkaiser
It wasn't just OpenAI. Google also used a general purpose model to solve the very hard math problems of the International Math Olympiad in plain language. Last year they used specialized tool use Increasing evidence of the ability of LLMs to generalize to novel problem solving
An advanced version of Gemini with Deep Think has officially achieved gold medal-level performance at the International Mathematical Olympiad. 🥇 It solved 5️⃣ out of 6️⃣ exceptionally difficult problems, involving algebra, combinatorics, geometry and number theory. Here’s how 🧵
Congratulations!!
An advanced version of Gemini with Deep Think has officially achieved gold medal-level performance at the International Mathematical Olympiad. 🥇 It solved 5️⃣ out of 6️⃣ exceptionally difficult problems, involving algebra, combinatorics, geometry and number theory. Here’s how 🧵
I asked ChatGPT Agent to turn an image from ChatGPT into a 3D printable file. It printed with no problem on my Bambu A1 printer:
To summarize this week: - we released general purpose computer using agent - got beaten by a single human in atcoder heuristics competition - solved 5/6 new IMO problems with natural language proofs All of those are based on the same single reinforcement learning system
AI winning gold in IMO is a huge deal. It was done without tools on new problems that haven't occurred in training data. Solving problems that most people in the world won't be able to solve. x.com/alexwei_/statu…
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).
TransEvalnia: Reasoning-based Evaluation and Ranking of Translations By Richard Sproat, Tianyu Zhao, Llion Jones ArXiv: arxiv.org/abs/2507.12724 We are happy to announce the release of TransEvalnia, a prompting-based translation evaluation and ranking system that uses reasoning…
I had early access & ChatGPT agent is, I think, a big step forward for getting AIs to do real work Even at this stage, it does a good job autonomously doing research & assembling Excel files (with formulas!), PowerPoint, etc. It gives a sense of how agents are coming together
Developers now often program in English for AI models. Some tasks can be solved by decomposition, others by introducing increasingly more constraints. IFScale is an interesting approach to see how many instructions an LLM can handle and what trends separate different models.
How many instructions can your LLM follow at once? Production LLM systems juggle 10-100s of instructions: policies, style, safety rules, tool use--but when do they overload? We introduce IFScale, a new benchmark measuring how instruction following degrades as instructions scale🧵
1/N Yesterday in Tokyo we @OpenAI ran a 10‑hour live Humans vs AI exhibition at the AtCoder World Tour Finals Heuristic. We pointed an OpenAI reasoning model at the same brutal problem the finalists tackled—no human help, same rules, same clock. Buckle up. 👇
Congratulations!!
Today we launch Asimov. Asimov is our code research agent that is best-in-class in codebase comprehension. It is built for teams, built for enterprises, and built to remember. We use it everyday to accelerate our velocity and streamline distributed ops. Link below to sign up…
Training data for future automated building?
🇨🇳 A team of construction workers in China operating excavators remotely. Grueling blue-collar work is now a cushy air-conditioned office job.
Our new preprint describes a multimodal intracortical brain-computer interface that a man with ALS has used at home, independently, almost every day for >19 months. It decodes both speech and cursor control to enable him to communicate and use his computer. Here’s a quick tour👇
🤖 What if a humanoid robot could make a hamburger from raw ingredients—all the way to your plate? 🔥 Excited to announce ViTacFormer: our new pipeline for next-level dexterous manipulation with active vision + high-resolution touch. 🎯 For the first time ever, we demonstrate…
In practice, for many useful applications, many of the various obvious problems with AI agents (drift, hallucination, compounding errors) are more solvable than they are in theory Clever prompting, tool use, constrained topics, LLM judges & organizational process close some gaps
When there is a lot of natural randomness and discovery in an AI use case (image creation, innovation), the focus should not be on single-threaded conversation that becomes self-reinforcing through autoregression, but embracing variance, randomness & branching. Calls for new UX
in a weird turn of events, turns out many neighborhood kids miss the communal shelter times, and now some parents are trying to arrange leisure shelter gatherings
That’s about right
Right now my AI usage is something like 66% o3-pro, 33% o3, 1% Veo 3, 0% everything else
Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team @tatsu_hashimoto @marcelroed @neilbband @rckpudi. Researchers are becoming detached from the technical details of how LMs work. In CS336, we try to fix that by having students build everything:
o3-pro: "write a sentence whose nouns are translations of constellation names & where the last letter of every word spells a constellation in its untranslated name. The first letter of each word must start with a vowel" I didn't even know if it was possible. It was. Impressive!
On Sunday I traveled to the middle of the desert to capture this: The ISS against our sun. What I didn't expect: the sun producing a magnificent flare at the same time A once-in-a-lifetime shot I'm thrilled to share with you. See the uncropped shot or get the print in the reply