Leandro von Werra

@lvwerra

Head of research @huggingface

Bern, Switzerland

Joined March 2019

370Following

9KFollowers

Pinned

Leandro von Werra@lvwerra · Feb 19

The Ultra-Scale Playbook: Training LLMs on GPU Clusters Learn how to train your own DeepSeek-V3 model using 5D parallelism, ZeRO, fast kernels, compute/comm overlap and bottlenecks with theory, interactive plots and 4000+ scaling experiments and audio! huggingface.co/spaces/nanotro…

lvwerra's tweet image. The Ultra-Scale Playbook: Training LLMs on GPU Clusters

Learn how to train your own DeepSeek-V3 model using 5D parallelism, ZeRO, fast kernels, compute/comm overlap and bottlenecks with theory, interactive plots and 4000+ scaling experiments and audio!

huggingface.co/spaces/nanotro…

310

198

32.0K

Leandro von Werra Retweeted

PyTorch@PyTorch · 11 h

SmolLM3-3B-8da4w: With #TorchAO & optimum-executorch, quantizing and exporting for mobile is a breeze. Now ready for on-device deployment with #ExecuTorch, running at 15 tokens/sec on Galaxy S22. 🔗 Model card with recipes + checkpoints: hubs.la/Q03yGyTN0 #EdgeAI #PyTorch

144

18.0K

Leandro von Werra@lvwerra · Jul 23

I'm in DC and the hype lives up. This Action Plan has everything: open source, evals, science of and with AI. The vibe shift on open source is momentous. I remember when. Congrats @sriramk @DavidSacks @mkratsios47 @deanwball

SSriram Krishnan@sriramk · Jul 23

🇺🇸 Today is a day we have been working towards for six months. We are announcing America’s AI action plan putting us on the road to continued AI dominance. The three core themes: - Accelerate AI innovation - Build American AI infrastructure - Lead in international AI…

7.0K

Leandro von Werra Retweeted

Andi Marafioti@andimarafioti · Jul 23

Many VLMs claim to process hours of video. But can they follow the story?🤔 Today, we introduce TimeScope: The benchmark that separates true temporal understanding from marketing hype. Let's see how much VLMs really understand!⏳

224

121

33.0K

Leandro von Werra Retweeted

elie@eliebakouch · Jul 21

We've just release 100+ intermediate checkpoints and our training logs from SmolLM3-3B training. We hope this can be useful to the researcher working on mech interpret, training dynamics, RL and other topics :) Training logs: -> Usual training loss (the gap in the loss are due…

392

192

31.0K

Leandro von Werra Retweeted

�

🤷 Nico Martin@nic_o_martin · Jul 21

Beyond happy to announce that I'm joining 🤗 @huggingface as a #MachineLearningEngineer focused on #WebML!

549

38.0K

Leandro von Werra Retweeted

Georgia Channing@cgeorgiaw · Jul 16

You know CASP? The competition that AlphaFold won that changed the game for AI x bio? 🧬 Just dropped all the data from their last challenge on @huggingface! check it out ⤵️ huggingface.co/datasets/cgeor…

2.0K

Leandro von Werra@lvwerra · Jul 15

smol and not so smol model are trending together in perfect harmony

840

Leandro von Werra@lvwerra · Jul 14

Personally, I really prefer casual online posts and discussions like this from engineers and researchers over super condensed papers. It's much more pleasant to read and understand the reasoning behind the decisions. Great job @Kimi_Moonshot team! zhihu.com/question/19271…

lvwerra's tweet image. Personally, I really prefer casual online posts and discussions like this from engineers and researchers over super condensed papers. It's much more pleasant to read and understand the reasoning behind the decisions.

Great job @Kimi_Moonshot team!

zhihu.com/question/19271…

186

13.0K

Leandro von Werra@lvwerra · Jul 11

Excited to share SmolTalk2: the secret sauce of SmolLM3's post-training and dual reasoning capabilities. So ideal if you want to train models think/no_think capabilities!

LLoubna Ben Allal@LoubnaBenAllal1 · Jul 11

Introducing SmolTalk2: the dataset behind SmolLM3's dual reasoning. - mid-training → 5M samples - SFT data → 3M samples - preferences for APO → 500k samples It combines open datasets with new ones curated for strong think and no_think performance. hf.co/datasets/Huggi…

1.0K

Leandro von Werra@lvwerra · Jul 10

Will be in Paris next week to visit the Hugging Face HQ. Anybody around for coffee?

1.0K

Leandro von Werra@lvwerra · Jul 9

cute compute…❤️❤️❤️

TThomas Wolf@Thom_Wolf · Jul 9

Thrilled to finally share what we've been working on for months at @huggingface 🤝@pollenrobotics Our first robot: Reachy Mini A dream come true: cute and low priced, hackable yet easy to use, powered by open-source and the infinite community. Tiny price, small size, huge…

106

137.0K

Leandro von Werra@lvwerra · Jul 8

SmolLM3 has day-zero support in mlx-lm and it's blazing fast on an M4 Max:

LLoubna Ben Allal@LoubnaBenAllal1 · Jul 8

Introducing SmolLM3: a strong, smol reasoner! > SoTA 3B model > dual mode reasoning (think/no_think) > long context, up to 128k > multilingual: en, fr, es, de, it, pt > fully open source (data, code, recipes) huggingface.co/blog/smollm3

237

32.0K

Leandro von Werra Retweeted

Thomas Wolf@Thom_Wolf · Jul 8

We’re releasing the top 3B model out there SOTA performances It has dual mode reasoning (with or without think) Extended long context up to 128k And it’s multilingual with strong support for en, fr, es, de, it, pt What do you need more? Oh yes we’re also open-sourcing all…

481

192

38.0K

Leandro von Werra Retweeted

elie@eliebakouch · Jul 8

Super excited to share SmolLM3, a new strong 3B model. SmolLM3 is fully open, we share the recipe, the dataset, the training codebase and much more! > Train on 11T token on 384 H100 for 220k GPU hours > Support long context up to 128k thanks to NoPE and intra document masking >…

137

835

474

114.0K

Leandro von Werra Retweeted

Loubna Ben Allal@LoubnaBenAllal1 · Jul 8

204

1.0K

524

265.0K

Leandro von Werra@lvwerra · Jul 8

SmolLM3 is out - a smol, long context, multilingual reasoner! Along with the model we release the full engineering blueprint for building SoTA LLMs at that scale, so everybody can build on it! Check it out: hf.co/blog/smollm3

lvwerra's tweet image. SmolLM3 is out - a smol, long context, multilingual reasoner!

Along with the model we release the full engineering blueprint for building SoTA LLMs at that scale, so everybody can build on it!

Check it out: hf.co/blog/smollm3

177

10.0K

Leandro von Werra Retweeted

Kasey Zhang@_WEEXIAO · Jul 3

It’s easy to fine-tune small models w/ RL to outperform foundation models on vertical tasks. We’re open sourcing Osmosis-Apply-1.7B: a small model that merges code (similar to Cursor’s instant apply) better than foundation models. Links to download and try out the model below!

136

1.0K

122.0K