Nouha Dziri
@nouhadziri
Research Scientist @allen_ai, PhD in NLP 🤖 UofA. Ex @GoogleDeepMind @MSFTResearch @MilaQuebec 🚨🚨 NEW BLOG about LLMs reasoning: https://shorturl.at/FEWKm
🚀📢 GPT models have blown our minds with their astonishing capabilities. But, do they truly acquire the ability to perform reasoning tasks that humans find easy to execute? NO⛔️ We investigate the limits of Transformers *empirically* and *theoretically* on compositional tasks🔥

Our #ACL2025NLP workshop REALM on LLM agents is happening July 31 in Vienna 🎶🎼 🗓️ Schedule & accepted papers are live! realm-workshop.github.io 🚀Join us for a day of invited talks, paper presentations and a panel discussion with an amazing line-up!
Interested in knowing more about LLMs agents and in contributing to this topic?🚀 📢We're thrilled to announce REALM: The first Workshop for Research on Agent Language Models 🤖 #ACL2025NLP in Vienna 🎻 We have an exciting lineup of speakers 🗓️ Submit your work by *March 1st*
life update: I'll be starting my PhD in CS at Stanford this September! I'm very excited to continue my research on reasoning of language models and to make new friends in the Bay Area! I'm deeply grateful to everyone who supported me and made this milestone possible…
Intelligence isn't a collection of skills. It's the efficiency with which you acquire and deploy new skills. It's an efficiency ratio. And that's why benchmark scores can be very misleading about the actual intelligence of AI systems.
Come listen to the second talk in 15min about the role of data in building trustworthy LLMs and memorization/creativity in LLMs. #ICML2025 Where? Ballroom West A
Super excited😍 to have been invited to speak at the Data in Generative Models workshop in ICML 2025 along this stellar lineup of speakers! I’ll be talking about AI safety, robustness, reasoning, and trustworthy LLMs. See you soon in Vancouver🇨🇦 Submit your work (ASAP) by May…
Speaking in 30min about agents and safety in Computer use agents at West Meeting room 211-214 #ICML2025
SUPER excited about next week #ICML2025 in Vancouver 🇨🇦 I'm invited to 3 talks/panels among a brilliant group of researchers. Come listen to different things about RL limits, generalization, post-training data mixing, agents evaluation and more 🔥👇 🤖Workshop on Computer Use…
We blend imitation (SFT) and exploration (RLVR) in post-training with a simple idea: Sample a prefix of an SFT demonstration, let your policy model complete it, and mix it with other RLVR rollouts Intuitively, the model relies more on hints for problems currently out of reach
🚀 Introducing Prefix-RFT to blend SFT and RFT! SFT can learn more complex problems by mimicking, but can have poor generalization. RFT has better overall performance but is limited by the initial policy. Our method, Prefix-RFT, makes the best of both worlds!
Hellooo Vancouver 🇨🇦again!!! I’m HERE with the sun all over me😎☀️☀️and the breathtaking views🥹Will be here until Saturday. You will find me in those talks/panels👇👇
SUPER excited about next week #ICML2025 in Vancouver 🇨🇦 I'm invited to 3 talks/panels among a brilliant group of researchers. Come listen to different things about RL limits, generalization, post-training data mixing, agents evaluation and more 🔥👇 🤖Workshop on Computer Use…
Current agents are highly unsafe, o3-mini one of the most advanced models in reasoning score 71% in executing harmful requests 😱 We introduce a new framework for evaluating agent safety✨🦺 Discover more 👇 👩💻 Code & data: github.com/Open-Agent-Saf… 📄 Paper:…
1/ AI agents are increasingly being deployed for real-world tasks, but how safe are they in high-stakes settings? 🚨 NEW: OpenAgentSafety - A comprehensive framework for evaluating AI agent safety in realistic scenarios across eight critical risk categories. 🧵
My group & collaborators have developed many popular benchmarks over the years, e.g., MMLU, MATH, APPS---really excited about our latest benchmark OMEGA Ω: 🔍Can LLMs really think outside the box in math? a new benchmark probing 3 axes of generalization: 1️⃣ Exploratory 2️⃣…
📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬 We built a benchmark to find out → OMEGA Ω 📐 💥 We found…
This just made my entire week🥹thank you for such incredibly kind words @jeremyphoward 🙏Now I'm nervous you'll actually read it thoroughly😅 Your work inspires me constantly! ✨
The legendary researcher behind the classic "Faith and Fate" paper has dropped a new paper on compositionality! 😁 I can't wait to dig into this. @nouhadziri is one of my absolute fave thinkers in this space so I have a feeling this will be a classic too…
Great insight from a new paper. Certainly we are learning more how LLMs "think" and "reason"
🤯 We noticed that many failures stem not from lack of knowledge but from overthinking. Models often find the right answer early in CoT, but spiral into self-corrections and abandon correct solutions. This challenges the assumption: More CoT ≠ better results Sometimes the…