Ferdinand Mom
@FerdinandMom
Distributed & Decentralized training @HuggingFace
Interested in 4D parallelism but feeling overwhelmed by Megatron-LM codebase? We are currently cooking something with @Haojun_Zhao14 and @xariusrke 😉 In the meantime, here is a self-contained script that implements Pipeline Parallelism (AFAB + 1F1B) in 200 LOC 🧵👇

A GPU is bound to fail at some point, and the more GPUs you have in cluster (scaling is all you need) the more failures you have. Meta did an amazing job on improving the fault tolerant all-reduce, and improving throughput with DiLoCo (yellow vs pink)
torchft + TorchTitan: 1200+ failures, no checkpoints, model convergence. A Llama 3 model was trained across 300 L40S GPUs with synthetic failures every 15s. No restarts. No rollbacks. Just asynchronous recovery and continued progress. 📘 hubs.la/Q03t1Z0b0 #PyTorch…
With the latest release, I want to make sure I get this message to the community: we are listening! @huggingface we are very ambitious and we want `transformers` to accelerate the ecosystem and enable all hardwares / platforms! Let's build AGI together 🫣 Unbloat and Enable!
🍷FineWeb now sits at 18.5T tokens, up 3.5T in just over a year. A few years ago, SOTA models like GPT3 and Gopher were trained on <300B tokens, on data only big labs could access. Today, anyone can download high-quality datasets many times that size and train their own.
Update: 🍷FineWeb and 📚 FineWeb-Edu now include English data from this year's CommonCrawl snapshots, covering Jan-Jun 2025. 🍷FineWeb now has 18.5 trillion tokens. We'll keep publishing timely updates to ensure your models have the latest world knowledge.
Update: 🍷FineWeb and 📚 FineWeb-Edu now include English data from this year's CommonCrawl snapshots, covering Jan-Jun 2025. 🍷FineWeb now has 18.5 trillion tokens. We'll keep publishing timely updates to ensure your models have the latest world knowledge.
we planned to launch our open-weight model next week. we are delaying it; we need time to run additional safety tests and review high-risk areas. we are not yet sure how long it will take us. while we trust the community will build great things with this model, once weights are…
We optimized LLM inference kernels for AMD’s MI300X GPUs (192GB 😮) using ROCm/HIP — and it’s all open source. 🔧 Tuned GEMM and fused kernels 📊 Benchmarked vs other GPUs 🚀 Big perf gains 🤝 Open-sourced everything Full write-up: huggingface.co/blog/mi300kern… #LLM #AI #AMD #MI300X
This blew my mind. One line of code: • Qwen2.5-0.5B training speed: 5% → 40% (MFU) • Qwen3-8B training speed: 34% → 54% (MFU) The culprit? A careless tensor transpose in the cross-entropy loss. Big thanks to @xingkaiyu for spotting it.
Introducing SmolLM3: a strong, smol reasoner! > SoTA 3B model > dual mode reasoning (think/no_think) > long context, up to 128k > multilingual: en, fr, es, de, it, pt > fully open source (data, code, recipes) huggingface.co/blog/smollm3
Who do we have? @GuggerSylvain, who will help introduce us to the initial concepts of distributed training and ZeRO @wanchao_, creator of `torchtitan`, one of the most used pretraining frameworks out there Less Wight from @PyTorch will be discussing Async Tensor Parallelism, a…
Pr #39120: Another day, another refactor: this time we are targeting the functionalities that are not model specific, but `transformers` specific.... "only" 146 files were touched
Distributed Training in Machine Learning🌍 Join us on July 12th as @Ar_Douillard explores key methods like FSDP, Pipeline & Expert Parallelism, plus emerging approaches like DiLoCo and SWARM—pushing the limits of global, distributed training. Learn more: tinyurl.com/9ts5bj7y
I'll discuss distributed learning on Saturday, July 12. First, I'll cover current methods needing high bandwidth, then next-generation methods for decentralized learning
Distributed Training in Machine Learning🌍 Join us on July 12th as @Ar_Douillard explores key methods like FSDP, Pipeline & Expert Parallelism, plus emerging approaches like DiLoCo and SWARM—pushing the limits of global, distributed training. Learn more: tinyurl.com/9ts5bj7y
We have finally released the 📝paper for 🥂FineWeb2, our large multilingual pre-training dataset. Along with general (and exhaustive) multilingual work, we introduce a concept that can also improve English performance: deduplication-based upsampling, which we call rehydration.
Evaluation was just made easier 💯 We merged a huge refacto of lighteval making easier to add: 🔄 Multiturn tasks 🖼️ Multimodal tasks 📝 Plus unified logs for thorough benchmark analysis Benchmarks guys, what evals would you like to see added ?
Launching SYNTHETIC-2: our next-gen open reasoning dataset and planetary-scale synthetic data generation run. Powered by our P2P inference stack and DeepSeek-R1-0528, it verifies traces for the hardest RL tasks. Contribute towards AGI via open, permissionless compute.
I've published a small taste of what the course will be like (essentially an introduction to `nbdistributed`) for free so you can understand how I'll be formatting my notes, documenting, and more for the rest of the content: maven.com/p/c4c9a9/free-…
pleased to open source this work - NoLoCo extends pipeline + data-parallel model training to heterogeneous gossip networks by modifying momentum and dynamically routing shards
Introducing NoLoCo NoLoCo trains large models over heterogeneous gossip networks, rather than high-bandwidth datacentres. It reduces synchronisation latency by 10x vs state of the art methods while converging 4% faster to the same validation loss. We're open sourcing it today.
And you made the most popular implementation ;) I’ve heard 75+% of the distributed learning startups at SPRIND used INTELLECT-1 codebase!