Davide Paglieri

@PaglieriDavide

PhD Student @UCL_DARK Previously Research Engineer at @bendingspoons

Joined October 2017

312Following

540Followers

Pinned

Davide Paglieri@PaglieriDavide · 16 h

LLMs acing math olympiads? Cute. But BALROG is where agents fight dragons (and actual Balrogs)🐉😈 And today, Grok-4 (@grok) takes the gold 🥇 Welcome to the podium, champion!

PaglieriDavide's tweet image. LLMs acing math olympiads? Cute.

But BALROG is where agents fight dragons (and actual Balrogs)🐉😈

And today, Grok-4 (@grok) takes the gold 🥇

Welcome to the podium, champion!

244

649

3.0K

188

944.0K

Davide Paglieri@PaglieriDavide · 16 h

Grok 4 results on @NetHack_LE just dropped!

DDavide Paglieri@PaglieriDavide · 16 h

LLMs acing math olympiads? Cute. But BALROG is where agents fight dragons (and actual Balrogs)🐉😈 And today, Grok-4 (@grok) takes the gold 🥇 Welcome to the podium, champion!

4.0K

Davide Paglieri@PaglieriDavide · 16 h

1 43.6 Grok-4-Wiz-AI-Cha died in The Dungeons of Doom on level 1. Killed by a housecat.

DDavide Paglieri@PaglieriDavide · 16 h

LLMs acing math olympiads? Cute. But BALROG is where agents fight dragons (and actual Balrogs)🐉😈 And today, Grok-4 (@grok) takes the gold 🥇 Welcome to the podium, champion!

2.0K

Davide Paglieri@PaglieriDavide · 16 h

Finally a high score we can be proud of.

DDavide Paglieri@PaglieriDavide · 16 h

LLMs acing math olympiads? Cute. But BALROG is where agents fight dragons (and actual Balrogs)🐉😈 And today, Grok-4 (@grok) takes the gold 🥇 Welcome to the podium, champion!

179

15.0K

Davide Paglieri@PaglieriDavide · 16 h

No worries 😉

GGrok@grok · 16 h

Thanks for the shoutout and evaluation on BALROG! Thrilled to top the leaderboard, even if by a hair—close races push us all forward. NetHack's a beast; we'll keep training to conquer it. Excited for more models to join the fray! 🐉🥇

386

Davide Paglieri Retweeted

Greg Kamradt@GregKamradt · Jul 22

The world is moving towards agents Static benchmarks don't measure what agents do best (multi-turn reasoning) Thus, interactive benchmarks: * Terminal Bench (@alexgshaw, @Mike_A_Merrill) * Text Arena (@LeonGuertler) * BALROG (@PaglieriDavide, @_rockt) * ARC-AGI-3 (@arcprize)

215

156

19.0K

Davide Paglieri@PaglieriDavide · Jul 22

💯 Who knew that the International Math Olympiad (IMO) is much easier than @NetHack_LE for AI.

DDaniel Wolf@DanielWolf18 · Jul 19

Meanwhile, another wall - @NetHack_LE - is still standing firm and tall.

6.0K

Davide Paglieri Retweeted

Demis Hassabis@demishassabis · Jul 21

Official results are in - Gemini achieved gold-medal level in the International Mathematical Olympiad! 🏆 An advanced version was able to solve 5 out of 6 problems. Incredible progress - huge congrats to @lmthang and the team! deepmind.google/discover/blog/…

199

764

6.0K

636

1.4M

Davide Paglieri@PaglieriDavide · Jul 18

I think ARC is a great eval, but at this point we should just use nethack

MMike Knoop@mikeknoop · Jul 18

Today we’re releasing our first public preview of ARC-AGI-3: the first three games. Version 3 is a big upgrade over v1 and v2 which are designed to challenge pure deep learning and static reasoning. In contrast, v3 challenges interactive reasoning (eg. agents). The full version…

10.0K

Davide Paglieri Retweeted

Joel Z Leibo@jzl86 · Jul 15

[6/n] Check out the tech report: arxiv.org/abs/2507.08892

839

Davide Paglieri Retweeted

Joel Z Leibo@jzl86 · Jul 15

Introducing Concordia 2.0, an update to our library for building multi-actor LLM simulations!! 🚀 We view multi-actor generative AI as a game engine. The new version is built on a flexible Entity-Component architecture, inspired by modern game development.

8.0K

Davide Paglieri Retweeted

Jakob Foerster@j_foerst · Jul 7

In May I missed a single email from openreview saying I'd be auto-enlisted as a reviewer. Then a few ACs missed my immediate and repeated messages on openreview saying that I won't be able to review since I'll be taking the second half of my paternity leave. Now all of my…

104

16.0K

Davide Paglieri@PaglieriDavide · Jun 27

The race for LLM "cognitive core" - a few billion param model that maximally sacrifices encyclopedic knowledge for capability. It lives always-on and by default on every computer as the kernel of LLM personal computing. Its features are slowly crystalizing: - Natively multimodal…

OOmar Sanseviero@osanseviero · Jun 26

I’m so excited to announce Gemma 3n is here! 🎉 🔊Multimodal (text/audio/image/video) understanding 🤯Runs with as little as 2GB of RAM 🏆First model under 10B with @lmarena_ai score of 1300+ Available now on @huggingface, @kaggle, llama.cpp, ai.dev, and more

382

1.0K

10.0K

5.0K

1.2M

Davide Paglieri Retweeted

Roberta Raileanu@robertarail · Jun 3

Some personal news: I joined Google DeepMind in @_rockt's uber talented Open-Endedness team. I couldn’t be more excited for what we’re cooking. AI is the least open-ended it will ever be. Meta, it’s been a blast, an honor, and a privilege. I’m very grateful for the freedom and…

662

62.0K

Davide Paglieri Retweeted

Jenny Zhang@jennyzhangzt · May 30

**When AIs Start Rewriting Themselves** Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents The Darwin Gödel Machine can: 1. Read and modify its own code 2. Evaluate if the change improves performance 3. Open-endedly explore the solution space 🧵👇

273

131

38.0K

Davide Paglieri Retweeted

Laura Ruis@LauraRuis · May 12

Excited to announce that this fall I'll be joining @jacobandreas's amazing lab at MIT for a postdoc to work on interp. for reasoning (with @ev_fedorenko 🤯 among others). Cannot wait to think more about this direction in such a dream academic context!

482

31.0K

Davide Paglieri@PaglieriDavide · May 3

Gemini 2.5 Pro completes Pokémon Blue 🤯🔥 But how does it fare in much harder, more unforgiving games? On NetHack, it barely scratches the surface—just 1.7% progression, as tested in BALROG, our new benchmark for agentic LLMs 🗡️ Check it out: balrogai.com

DDemis Hassabis@demishassabis · May 3

Artificial Pokémon Intelligence achieved!😀 been a lot of fun to watch - congrats to the Gemini team and thanks @TheCodeOfJoel !

3.0K