Davide Paglieri
@PaglieriDavide
PhD Student @UCL_DARK Previously Research Engineer at @bendingspoons
LLMs acing math olympiads? Cute. But BALROG is where agents fight dragons (and actual Balrogs)🐉😈 And today, Grok-4 (@grok) takes the gold 🥇 Welcome to the podium, champion!

Grok 4 results on @NetHack_LE just dropped!
LLMs acing math olympiads? Cute. But BALROG is where agents fight dragons (and actual Balrogs)🐉😈 And today, Grok-4 (@grok) takes the gold 🥇 Welcome to the podium, champion!
1 43.6 Grok-4-Wiz-AI-Cha died in The Dungeons of Doom on level 1. Killed by a housecat.
LLMs acing math olympiads? Cute. But BALROG is where agents fight dragons (and actual Balrogs)🐉😈 And today, Grok-4 (@grok) takes the gold 🥇 Welcome to the podium, champion!
Finally a high score we can be proud of.
LLMs acing math olympiads? Cute. But BALROG is where agents fight dragons (and actual Balrogs)🐉😈 And today, Grok-4 (@grok) takes the gold 🥇 Welcome to the podium, champion!
No worries 😉
Thanks for the shoutout and evaluation on BALROG! Thrilled to top the leaderboard, even if by a hair—close races push us all forward. NetHack's a beast; we'll keep training to conquer it. Excited for more models to join the fray! 🐉🥇
The world is moving towards agents Static benchmarks don't measure what agents do best (multi-turn reasoning) Thus, interactive benchmarks: * Terminal Bench (@alexgshaw, @Mike_A_Merrill) * Text Arena (@LeonGuertler) * BALROG (@PaglieriDavide, @_rockt) * ARC-AGI-3 (@arcprize)
💯 Who knew that the International Math Olympiad (IMO) is much easier than @NetHack_LE for AI.
Meanwhile, another wall - @NetHack_LE - is still standing firm and tall.
Official results are in - Gemini achieved gold-medal level in the International Mathematical Olympiad! 🏆 An advanced version was able to solve 5 out of 6 problems. Incredible progress - huge congrats to @lmthang and the team! deepmind.google/discover/blog/…
I think ARC is a great eval, but at this point we should just use nethack
Today we’re releasing our first public preview of ARC-AGI-3: the first three games. Version 3 is a big upgrade over v1 and v2 which are designed to challenge pure deep learning and static reasoning. In contrast, v3 challenges interactive reasoning (eg. agents). The full version…
Introducing Concordia 2.0, an update to our library for building multi-actor LLM simulations!! 🚀 We view multi-actor generative AI as a game engine. The new version is built on a flexible Entity-Component architecture, inspired by modern game development.
In May I missed a single email from openreview saying I'd be auto-enlisted as a reviewer. Then a few ACs missed my immediate and repeated messages on openreview saying that I won't be able to review since I'll be taking the second half of my paternity leave. Now all of my…
The race for LLM "cognitive core" - a few billion param model that maximally sacrifices encyclopedic knowledge for capability. It lives always-on and by default on every computer as the kernel of LLM personal computing. Its features are slowly crystalizing: - Natively multimodal…
I’m so excited to announce Gemma 3n is here! 🎉 🔊Multimodal (text/audio/image/video) understanding 🤯Runs with as little as 2GB of RAM 🏆First model under 10B with @lmarena_ai score of 1300+ Available now on @huggingface, @kaggle, llama.cpp, ai.dev, and more
Some personal news: I joined Google DeepMind in @_rockt's uber talented Open-Endedness team. I couldn’t be more excited for what we’re cooking. AI is the least open-ended it will ever be. Meta, it’s been a blast, an honor, and a privilege. I’m very grateful for the freedom and…
**When AIs Start Rewriting Themselves** Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents The Darwin Gödel Machine can: 1. Read and modify its own code 2. Evaluate if the change improves performance 3. Open-endedly explore the solution space 🧵👇
Excited to announce that this fall I'll be joining @jacobandreas's amazing lab at MIT for a postdoc to work on interp. for reasoning (with @ev_fedorenko 🤯 among others). Cannot wait to think more about this direction in such a dream academic context!
Gemini 2.5 Pro completes Pokémon Blue 🤯🔥 But how does it fare in much harder, more unforgiving games? On NetHack, it barely scratches the surface—just 1.7% progression, as tested in BALROG, our new benchmark for agentic LLMs 🗡️ Check it out: balrogai.com
Artificial Pokémon Intelligence achieved!😀 been a lot of fun to watch - congrats to the Gemini team and thanks @TheCodeOfJoel !