Lewis Tunstall
@_lewtun
🤗 LLM whisperer @huggingface 📖 Co-author of "NLP with Transformers" book 💥 Ex-particle physicist 🤘 Occasional guitarist 🇦🇺 in 🇨🇭
We are reproducing the full DeepSeek R1 data and training pipeline so everybody can use their recipe. Instead of doing it in secret we can do it together in the open! 🧪 Step 1: replicate the R1-Distill models by distilling a high-quality reasoning corpus from DeepSeek-R1. 🧠…
There's now support for viewing JSON in string / dict columns in @huggingface datasets!!! 🔍 Great for all the tool calling datasets like the brand new hermes tool use dataset by @intrstllrninja
NEW 🔥!! There's can now view JSON for List cells on @huggingface datasets. Now there's no excuse for looking at your data! 🫣
🚨 Olympiad math + AI: We ran Google’s Gemini 2.5 Pro on the fresh IMO 2025 problems. With careful prompting and pipeline design, it solved 5 out of 6 — remarkable for tasks demanding deep insight and creativity. The model could win gold! 🥇 #AI #Math #LLMs #IMO2025
HLE has recently become the benchmark to beat for frontier agents. We @FutureHouseSF took a closer look at the chem and bio questions and found about 30% of them are likely invalid based on our analysis and third-party PhD evaluations. 1/7
Paranoia (aka looking at your data) is the main difference between a model having garbage vibes or not :)
What I look for when hiring? EXTREME PARANOIA about code and data
After three intense months of hard work with the team, we made it! We hope this release can help drive the progress of Coding Agents. Looking forward to seeing Qwen3-Coder continue creating new possibilities across the digital world!
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…
You should join Cody for the memes alone 🦝
We are looking for a post-training lead at @datologyai we have gpus, you can make them go brrrr
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…
Excited to share 🤯 that our LMUnit models with @ContextualAI just claimed the top spots on RewardBench2 🥇 How did we manage to rank +5% higher than models like Gemini, Claude 4, and GPT4.1? More in the details below: 🧵 1/11
today i'm releasing 50k rows of tool-use reasoning dataset compilation on huggingface includes following BFCL scenarios: - single turn tool-use - multiturn tool-use - multistep tool-use - relevance reasoning huggingface.co/datasets/inter…
OpenAI and GDM should release IMO reasoning traces. For Science.
We've just release 100+ intermediate checkpoints and our training logs from SmolLM3-3B training. We hope this can be useful to the researcher working on mech interpret, training dynamics, RL and other topics :) Training logs: -> Usual training loss (the gap in the loss are due…
It's clear that the next big thing after the shift from RLHF to "RLVR"* is scaling reward models ("verifiers") for concrete capabilities, not just average human preferences. This actually kinda looks very similar to RLHF. The main difference is that the verifiers here: [A] Are…
The sad robot in matharena.ai/imo/ is Grok 4. This shows again how careful one has to be with overblown claims from closed releases saying the usual "it's so over". Test contamination that cannot be checked makes benchs look great, but on novel problems, the crash comes.
Two cents on AI getting International Math Olympiad (IMO) Gold, from a mathematician. Background: Last year, Google DeepMind (GDM) got Silver in IMO 2024. This year, OpenAI solved problems P1-P5 for IMO 2025 (but not P6), and this performance corresponds to Gold. (1/10)
Here’s how you train an email agent from scratch with GRPO 👇 1️⃣ Nail a prompted baseline first. It flushes out tool bugs & gives you a benchmark to beat. 2️⃣ When the plateau hits, switch to RL. A 14B model jumped 40%→96% —beating o3 & Gemini—by laser-focusing on one job.
What seems like an exponential in AI is just a series of S curves. Each era rides on a wave of increasing compute but finds a new way to utilise it - overcoming limitations of the previous stage. Eg pre-training was the dominant way to utilise compute, but the limitations of…
I think it's quite disputed to say it came as a surprise. Was 20% this week but it's been a lot higher. Seems hard to reason about. I don't like it when people say "It's safe to say" things it's not in fact safe to say.
I think it's safe to say this @OpenAI IMO gold result came as a bit of a surprise to folks