Edoardo Ponti
@PontiEdoardo
Assistant Professor in #NLP at @EdinburghUni | A Kleene star shines on the hour of our meeting
Reach out to @yifuqiu98 if you’re looking for a research scientist starting next year! He is extremely talented and he’s been doing fantastic research on world models inside general-purpose LLMs/VLMs
Most importantly, I will be in job market for 2026. If you have any research positions about language grounding (world model), hallucinations, safety for foundation models. Let’s discuss about it!
The amazing folks at @EdinburghNLP will be presenting a few papers at ACL 2025 (@aclmeeting); if you're in Vienna, touch base with them! Here are the papers in the main track 🧵
*The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs* by @p_nawrot @PontiEdoardo @cheeesio @seb_ruder They study sparse attention techniques at scale, comparing to small dense models at the same compute budget. arxiv.org/abs/2504.17768
We blend imitation (SFT) and exploration (RLVR) in post-training with a simple idea: Sample a prefix of an SFT demonstration, let your policy model complete it, and mix it with other RLVR rollouts Intuitively, the model relies more on hints for problems currently out of reach
🚀 Introducing Prefix-RFT to blend SFT and RFT! SFT can learn more complex problems by mimicking, but can have poor generalization. RFT has better overall performance but is limited by the initial policy. Our method, Prefix-RFT, makes the best of both worlds!
If you are at @icmlconf make sure to attend @AdrianLancucki’s invited talk on our inference-time *hyper*-scaling paper (and more!) at the tokenization workshop this Friday tokenization-workshop.github.io/schedule/
🚀 By *learning* to compress the KV cache in Transformer LLMs, we can generate more tokens for the same compute budget. This unlocks *inference-time hyper-scaling* For the same runtime or memory load, we can boost LLM accuracy by pushing reasoning even further!
Thanks for acknowledging Dynamic Token Pooling as a predecessor to H-Net, @_albertgu! We had some decent ideas in that paper (e2e and entropy-based tokenisation), but it surprises me that it took 2 years (an eternity in NLP) to find the right recipe and scale better than BPE
Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
The 4th Cardiff #NLProc Summer Workshop starts tomorrow! We'll have two full days of insightful talks, hands-on sessions, and networking. 📅 Check out the full schedule here: cardiffnlpworkshop.org/schedule
I thoroughly enjoyed reading @vernadankers's dissertation; my personal highlight was her idea of maps that track the training memorisation versus test generalisation of each example. I wish you all the best for the upcoming postdoc with @sivareddyg and his wonderful group!
I miss Edinburgh and its wonderful people already!! Thanks to @tallinzen and @PontiEdoardo for inspiring discussions during the viva! I'm now exchanging Arthur's Seat for Mont Royal to join @sivareddyg's wonderful lab @Mila_Quebec 🤩
Test-time scaling is all over the place right now, here we try to pack knowledge of particular documents in LoRAs (in "knowledge modules"), by performing expensive computation offline, such that test time computation is quick; i see this as precomputing and storing possible…
RAG and in-context learning are the go-to approaches for integrating new knowledge into LLMs, making inference very inefficient We propose instead 𝗞𝗻𝗼𝘄𝗹𝗲𝗱𝗴𝗲 𝗠𝗼𝗱𝘂𝗹𝗲𝘀 : lightweight LoRA modules trained offline that can match RAG performance without the drawbacks
RAG and in-context learning are the go-to approaches for integrating new knowledge into LLMs, making inference very inefficient We propose instead 𝗞𝗻𝗼𝘄𝗹𝗲𝗱𝗴𝗲 𝗠𝗼𝗱𝘂𝗹𝗲𝘀 : lightweight LoRA modules trained offline that can match RAG performance without the drawbacks
We built sparse-frontier — a clean abstraction that lets you focus on your custom sparse attention implementation while automatically inheriting vLLM’s optimizations and model support. As a PhD student, I've learned that sometimes the bottleneck in research isn't ideas — it's…
NVIDIA Researchers Introduce Dynamic Memory Sparsification (DMS) for 8× KV Cache Compression in Transformer LLMs As the demand for reasoning-heavy tasks grows, large language models (LLMs) are increasingly expected to generate longer sequences or parallel chains of reasoning.…
Cool use of our AURORA work from last year to improve physical world models framed as image editing!
🔁 What if you could bootstrap a world model (state1 × action → state2) using a much easier-to-train dynamics model (state1 × state2 → action) in a generalist VLM? 💡 We show how a dynamics model can generate synthetic trajectories & serve for inference-time verification 🧵👇
[LG] Training Plug-n-Play Knowledge Modules with Deep Context Distillation L Caccia, A Ansell, E Ponti, I Vulić... [Microsoft Research Montreal & University of Cambridge] (2025) arxiv.org/abs/2503.08727