Guilherme Penedo (@gui_penedo)

Pinned

G

We have finally released the 📝paper for 🥂FineWeb2, our large multilingual pre-training dataset. Along with general (and exhaustive) multilingual work, we introduce a concept that can also improve English performance: deduplication-based upsampling, which we call rehydration.

gui_penedo's tweet image. We have finally released the 📝paper for 🥂FineWeb2, our large multilingual pre-training dataset.

Along with general (and exhaustive) multilingual work, we introduce a concept that can also improve English performance: deduplication-based upsampling, which we call rehydration.

8

96

427

237

74.0K

Pinned

Guilherme Penedo Retweeted

H

Hynek Kydlíček@HKydlicek · Jul 5

I don't think we need an American DeepSeek Project, we need an Open-Data DeepSeek. And no we didn't get one yet, despite what you might think, so let me explain. The biggest contributor to the gap between closed-source and open-source AI is, in my opinion, data accessibility and…

8

19

170

74

35.0K

G

Guilherme Penedo@gui_penedo · Jul 9

FineWeb2 🥂 has been accepted to @COLM_conf See you in October 🇨🇦

GGuilherme Penedo@gui_penedo · Jun 27

We have finally released the 📝paper for 🥂FineWeb2, our large multilingual pre-training dataset. Along with general (and exhaustive) multilingual work, we introduce a concept that can also improve English performance: deduplication-based upsampling, which we call rehydration.

4

9

99

14

6.0K

G

Guilherme Penedo@gui_penedo · Jul 8

SmolLM3 just dropped! It's the first large-scale LLM pre-trained on FW-2, which makes it fully fluent on 5+ new languages beyond English 🌍. By making the model both smol and multilingual, we're taking a real step toward ensuring more people can access and benefit from LLMs

LLoubna Ben Allal@LoubnaBenAllal1 · Jul 8

Introducing SmolLM3: a strong, smol reasoner! > SoTA 3B model > dual mode reasoning (think/no_think) > long context, up to 128k > multilingual: en, fr, es, de, it, pt > fully open source (data, code, recipes) huggingface.co/blog/smollm3

1

2

17

0

699

Guilherme Penedo Retweeted

L

Loubna Ben Allal@LoubnaBenAllal1 · Jul 8

Introducing SmolLM3: a strong, smol reasoner! > SoTA 3B model > dual mode reasoning (think/no_think) > long context, up to 128k > multilingual: en, fr, es, de, it, pt > fully open source (data, code, recipes) huggingface.co/blog/smollm3

68

204

1.0K

524

265.0K

Guilherme Penedo Retweeted

D

Daniel van Strien@vanstriendaniel · Jul 8

465 people. 122 languages. 58,185 annotations! FineWeb-C v1 is complete! Communities worldwide have built their own educational quality datasets, proving that we don't need to wait for big tech to support languages. Huge thanks to all who contributed! huggingface.co/blog/davanstri…

4

30

113

38

20.0K

G

Guilherme Penedo@gui_penedo · Jun 18

Finally had a bit of time to jot down some thoughts on this solid, open data engineering work from @essential_ai. This work brings Essential-Web, a 24T-token pre-training corpus, to the open-source community. I've always appreciated open-source research, as it can significantly…

EEssential AI@essential_ai · Jun 18

[1/5] 🚀 Meet Essential-Web v1.0, a 24-trillion-token pre-training dataset with rich metadata built to effortlessly curate high-performing datasets across domains and use cases!

2

17

83

33

10.0K

G

Guilherme Penedo@gui_penedo · Jun 6

Very happy to have played a (very small) part in the release of this very large fully open dataset. We finally have an answer to the question: "how good a model can we get with fully permissible data"? Turns out, not bad at all

EEleutherAI@AiEleuther · Jun 6

Can you train a performant language models without using unlicensed text? We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&2

1

6

24

4

2.0K