Ferdinand Mom

@FerdinandMom

Distributed & Decentralized training @HuggingFace

France

Joined October 2013

1KFollowing

3KFollowers

Pinned

Ferdinand Mom@FerdinandMom · Oct 2

Interested in 4D parallelism but feeling overwhelmed by Megatron-LM codebase? We are currently cooking something with @Haojun_Zhao14 and @xariusrke 😉 In the meantime, here is a self-contained script that implements Pipeline Parallelism (AFAB + 1F1B) in 200 LOC 🧵👇

FerdinandMom's tweet image. Interested in 4D parallelism but feeling overwhelmed by Megatron-LM codebase? We are currently cooking something with @Haojun_Zhao14 and @xariusrke 😉

In the meantime, here is a self-contained script that implements Pipeline Parallelism (AFAB + 1F1B) in 200 LOC 🧵👇

231

145

26.0K

Pinned

Ferdinand Mom@FerdinandMom · Jun 20

A GPU is bound to fail at some point, and the more GPUs you have in cluster (scaling is all you need) the more failures you have. Meta did an amazing job on improving the fault tolerant all-reduce, and improving throughput with DiLoCo (yellow vs pink)

PPyTorch@PyTorch · Jun 20

torchft + TorchTitan: 1200+ failures, no checkpoints, model convergence. A Llama 3 model was trained across 300 L40S GPUs with synthetic failures every 15s. No restarts. No rollbacks. Just asynchronous recovery and continued progress. 📘 hubs.la/Q03t1Z0b0 #PyTorch…

6.0K

Ferdinand Mom Retweeted

Arthur Zucker@art_zucker · Jul 25

With the latest release, I want to make sure I get this message to the community: we are listening! @huggingface we are very ambitious and we want `transformers` to accelerate the ecosystem and enable all hardwares / platforms! Let's build AGI together 🫣 Unbloat and Enable!

9.0K

Ferdinand Mom@FerdinandMom · Jul 16

🍷FineWeb now sits at 18.5T tokens, up 3.5T in just over a year. A few years ago, SOTA models like GPT3 and Gopher were trained on <300B tokens, on data only big labs could access. Today, anyone can download high-quality datasets many times that size and train their own.

GGuilherme Penedo@gui_penedo · Jul 14

Update: 🍷FineWeb and 📚 FineWeb-Edu now include English data from this year's CommonCrawl snapshots, covering Jan-Jun 2025. 🍷FineWeb now has 18.5 trillion tokens. We'll keep publishing timely updates to ensure your models have the latest world knowledge.

3.0K

Ferdinand Mom Retweeted

Guilherme Penedo@gui_penedo · Jul 14

178

14.0K

Ferdinand Mom Retweeted

Sam Altman@sama · Jul 12

we planned to launch our open-weight model next week. we are delaying it; we need time to run additional safety tests and review high-risk areas. we are not yet sure how long it will take us. while we trust the community will build great things with this model, once weights are…

2.0K

1.0K

20.0K

2.0K

3.5M

Ferdinand Mom Retweeted

Rémi Ouazan@gpus_go_brrr · Jul 9

We optimized LLM inference kernels for AMD’s MI300X GPUs (192GB 😮) using ROCm/HIP — and it’s all open source. 🔧 Tuned GEMM and fused kernels 📊 Benchmarked vs other GPUs 🚀 Big perf gains 🤝 Open-sourced everything Full write-up: huggingface.co/blog/mi300kern… #LLM #AI #AMD #MI300X

143

6.0K

Ferdinand Mom Retweeted

Haojun Zhao@Haojun_Zhao14 · Jul 8

This blew my mind. One line of code: • Qwen2.5-0.5B training speed: 5% → 40% (MFU) • Qwen3-8B training speed: 34% → 54% (MFU) The culprit? A careless tensor transpose in the cross-entropy loss. Big thanks to @xingkaiyu for spotting it.

715

480

49.0K

Ferdinand Mom Retweeted

Loubna Ben Allal@LoubnaBenAllal1 · Jul 8

Introducing SmolLM3: a strong, smol reasoner! > SoTA 3B model > dual mode reasoning (think/no_think) > long context, up to 128k > multilingual: en, fr, es, de, it, pt > fully open source (data, code, recipes) huggingface.co/blog/smollm3

203

1.0K

524

266.0K

Ferdinand Mom Retweeted

Zach Mueller@TheZachMueller · Jul 8

Who do we have? @GuggerSylvain, who will help introduce us to the initial concepts of distributed training and ZeRO @wanchao_, creator of `torchtitan`, one of the most used pretraining frameworks out there Less Wight from @PyTorch will be discussing Async Tensor Parallelism, a…

1.0K

Ferdinand Mom Retweeted

Arthur Zucker@art_zucker · Jul 7

Pr #39120: Another day, another refactor: this time we are targeting the functionalities that are not model specific, but `transformers` specific.... "only" 146 files were touched

746

Ferdinand Mom Retweeted

Cohere Labs@Cohere_Labs · Jul 4

Distributed Training in Machine Learning🌍 Join us on July 12th as @Ar_Douillard explores key methods like FSDP, Pipeline & Expert Parallelism, plus emerging approaches like DiLoCo and SWARM—pushing the limits of global, distributed training. Learn more: tinyurl.com/9ts5bj7y

8.0K

Ferdinand Mom@FerdinandMom · Jul 4

I'll discuss distributed learning on Saturday, July 12. First, I'll cover current methods needing high bandwidth, then next-generation methods for decentralized learning

CCohere Labs@Cohere_Labs · Jul 4

113

13.0K

Ferdinand Mom Retweeted

seb@datauzi · Jul 3

All I could think about when I read the announcement

277

14.0K

Ferdinand Mom Retweeted

Guilherme Penedo@gui_penedo · Jun 27

We have finally released the 📝paper for 🥂FineWeb2, our large multilingual pre-training dataset. Along with general (and exhaustive) multilingual work, we introduce a concept that can also improve English performance: deduplication-based upsampling, which we call rehydration.

427

237

74.0K

Ferdinand Mom Retweeted

Nathan@nathanhabib1011 · Jun 25

Evaluation was just made easier 💯 We merged a huge refacto of lighteval making easier to add: 🔄 Multiturn tasks 🖼️ Multimodal tasks 📝 Plus unified logs for thorough benchmark analysis Benchmarks guys, what evals would you like to see added ?

1.0K

Ferdinand Mom Retweeted

Prime Intellect@PrimeIntellect · Jun 23

Launching SYNTHETIC-2: our next-gen open reasoning dataset and planetary-scale synthetic data generation run. Powered by our P2P inference stack and DeepSeek-R1-0528, it verifies traces for the hardest RL tasks. Contribute towards AGI via open, permissionless compute.

201

1.0K

398

344.0K

Ferdinand Mom Retweeted

Zach Mueller@TheZachMueller · Jun 23

I've published a small taste of what the course will be like (essentially an introduction to `nbdistributed`) for free so you can understand how I'll be formatting my notes, documenting, and more for the rest of the content: maven.com/p/c4c9a9/free-…

10.0K

Ferdinand Mom@FerdinandMom · Jun 13

pleased to open source this work - NoLoCo extends pipeline + data-parallel model training to heterogeneous gossip networks by modifying momentum and dynamically routing shards

ggensyn@gensynai · Jun 13

Introducing NoLoCo NoLoCo trains large models over heterogeneous gossip networks, rather than high-bandwidth datacentres. It reduces synchronisation latency by 10x vs state of the art methods while converging 4% faster to the same validation loss. We're open sourcing it today.

100

11.0K

Ferdinand Mom Retweeted

Arthur Douillard@Ar_Douillard · Jun 20

And you made the most popular implementation ;) I’ve heard 75+% of the distributed learning startups at SPRIND used INTELLECT-1 codebase!

5.0K

Ferdinand Mom Retweeted

Hynek Kydlíček@HKydlicek · Jun 20

I am on this picture and I don't like it

448