Luke Emberson
@lukefrymire
Data @EpochAIResearch
Great discussion in @mattyglesias 's mailbag today about loss of control risk.
Thank you to the dev at Medieval Times who decided I might want to bring 100 million guests.

Yeah we did exactly that
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
Yikes! Better hope all of the content being output and then scraped from the internet is benign...
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
There’s a certain type of LLM skeptic whose tune will soon change, not because the tech improves but because it stops being the contrarian thing to hate on
This morning @GregHBurnham gave a great presentation interpreting Grok 4's IMO solutions. I hadn't appreciated how just how hopeless it would be for me to assess models at the frontier of proof construction, unaided. Seems like a really good testbed for debate procedures.
People thought solving chess would be sufficient for general reasoning. As a domain, math seems closer to that than to acting in the real world.
Based on the current state of LLMs, “gold medal on IMO” seems easier than “play Pokémon as well as an average ten-year-old” or “act as a reliable secretary.” It’s useful and exciting, but I’d predict “do my job for me” much more readily if it could do the latter instead of the…
To summarize this week: - we released general purpose computer using agent - got beaten by a single human in atcoder heuristics competition - solved 5/6 new IMO problems with natural language proofs All of those are based on the same single reinforcement learning system
Humbling. Can’t think of any benchmarks where I’d give <10% likelihood on total saturation within four years!
9/N Still—this underscores how fast AI has advanced in recent years. In 2021, my PhD advisor @JacobSteinhardt had me forecast AI math progress by July 2025. I predicted 30% on the MATH benchmark (and thought everyone else was too optimistic). Instead, we have IMO gold.
At the risk of falling for the METR downlift mistake, recent improvements to Gemini in Colab and ChatGPT agent mode seem like substantial productivity boosts for my workflows.
We have graded the results of @OpenAI's evaluation on FrontierMath Tier 1–3 questions, and found a 27% (± 3%) performance. ChatGPT agent is a new model fine-tuned for agentic tasks, equipped with text/GUI browser tools and native terminal access. 🧵
METR previously estimated that the time horizon of AI agents on software tasks is doubling every 7 months. We have now analyzed 9 other benchmarks for scientific reasoning, math, robotics, computer use, and self-driving; we observe generally similar rates of improvement.
When will AI systems be able to carry out long projects independently? In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.
The first paper I’ve worked on as a PhD student is out! Very proud of this work.
Can an AI model predict perfectly and still have a terrible world model? What would that even mean? Our new ICML paper formalizes these questions One result tells the story: A transformer trained on 10M solar systems nails planetary orbits. But it botches gravitational laws 🧵
Introducing FrontierMath Tier 4: a benchmark of extremely challenging research-level math problems, designed to test the limits of AI’s reasoning capabilities.
The rare men’s only bathroom lineup at dwarkesh / sarah paine lecture
pro tip: you can basically read >100 books per day by asking chatgpt to summarize them for you.
In the 18th century, there was a real chance of death at any point in life, and there wasn't a big peak in old age. It wasn't just higher infant mortality - the whole distribution was completely different. Great chart by @Scientific_Bird.
Is test time training actually done with any production models yet? I thought it was all rag slop still