Nouamane Tazi
@Nouamanetazi
ML Research Engineer @huggingface 🤗. Scale it 'til you make it 🇵🇸🕊
SmolLM3 is out! Proud to have led the distributed work on this one. 💪🏻 So many mishaps and stories to tell, stay tuned for more details soon.. 👀 Everything open-sourced as usual for you to reproduce your own LLM training confidently: huggingface.co/blog/smollm3 🤗

We've just release 100+ intermediate checkpoints and our training logs from SmolLM3-3B training. We hope this can be useful to the researcher working on mech interpret, training dynamics, RL and other topics :) Training logs: -> Usual training loss (the gap in the loss are due…
Super excited to share SmolLM3, a new strong 3B model. SmolLM3 is fully open, we share the recipe, the dataset, the training codebase and much more! > Train on 11T token on 384 H100 for 220k GPU hours > Support long context up to 128k thanks to NoPE and intra document masking >…
Introducing SmolLM3: a strong, smol reasoner! > SoTA 3B model > dual mode reasoning (think/no_think) > long context, up to 128k > multilingual: en, fr, es, de, it, pt > fully open source (data, code, recipes) huggingface.co/blog/smollm3
It's out finallyyyy👌🏻
Today, we are excited to launch 𝑇𝑟𝑎𝑐𝑘𝑖𝑜, a lightweight experiment tracking and visualization library — written in <1,000 lines of Python — that is completely open-source, 100% free to use, locally or hosted.
This is the *real* impact imo. Always a pleasure to hear such feedbacks. 🤗 Im also very excited abt scaling RL workloads...
I like the Ultra Scale Playbook from @huggingface and give it to my MS/first year PhD students to read as a prereq huggingface.co/spaces/nanotro… Is there an "RLSys" version of this on scaling RL+LLM training? If not + there's OSS community interest, I'll prob write one?
Announcing - Falcon-Edge – a series of powerful, universal and fine-tunable Bitnet models for everyone! We also release a Python fine-tuning toolkit library - `onebitllms`- specialized for Bitnet models. Announcement blogpost: falcon-lm.github.io/blog/falcon-ed…
🔥 Evaluating LLMs? You need Lighteval — the fastest, most flexible toolkit for benchmarking models, built by @huggingface Now with: ✅ Plug & play custom model inference (evaluate any backend) 📈 Tasks like AIME, GPQA:diamond, SimpleQA, and hundreds more Details below 🧵👇
Can you beat Qwen3 in a race across Wikipedia? 🏁 Go head-to-head with Qwen, Gemma, and DeepSeek as you race from Pokémon → Jennifer Aniston → anywhere you like. 🧵
We're launching Computer Use in smolagents! 🥳 -> As vision models become more capable, they become able to power complex agentic workflows. Especially Qwen-VL models, that support built-in grounding, i.e. ability to locate any element in an image by its coordinates, thus to…
Hey there! The Huggingface Ultra Scale Playbook is a detailed open-source guide published by Hugging Face. It explains the methods and tech involved in efficiently training large language models (LLMs) across many GPUs, often called GPU clusters. The playbook covers topics like…
DeepMind’s How to Scale and HuggingFace’s Ultra-Scale Playbook were super helpful. If you are interested in training large models, go read them now!
We built custom sparse all-to-all kernels on NVSHMEM that split operations into send/receive components, implement minimal synchronization, and support GPU-initiated communication. This enables efficient Expert Parallel inference on NVLink and CX-7 and is EFA-compatible.
the only way i'm posting grad_norm plot from now on
I'm using muon, and my grad norm is randomly forming an M shape, wtf
Lecture 16: Parallelism and Scaling youtu.be/Mpg1YJfAEH0 - Basics of training on one device - Parallelization on multiple devices (e.g., data, tensor, pipeline parallel) - Combining and comparing strategies
Excited to teach Advanced NLP at CMU this semester! Slides are on the course page as the course proceeds: cmu-l3.github.io/anlp-spring202… Lectures will be uploaded to Youtube: youtube.com/playlist?list=…
Small util merged in transformers, open to contribution to extend it to all models! For now I tested `gemma3`, `gemma2`, `paligemma` and `mistral`! Curious to see some of the more special ones 👀 (mllama? Qwen-Audio? Whisper? Qwen-VL?)
👀
.@Thom_Wolf on the Boom project training a 70-100B parameter model decentralized
The template behind The Ultra-Scale Playbook is out ✨ Open-sourcing everything ftw 🔥
Introducing the @distillpub Blog Template on the Hub! Can we bring back the good old distill.pub days with the super educational and well explained posts? We used this template for the FineWeb and Ultra Scale Playbook blog posts and want you to write similar blogs!…