Quentin Lhoest 🤗

@lhoestq

Datasets @huggingface | Open Source + HF Dataset Hub

Joined September 2013

290Following

4KFollowers

Pinned

Quentin Lhoest 🤗@lhoestq · Jul 23, 2024

Happy to share this new app: The 🤗 Infinite Dataset Hub ♾️ It's a 100% synthetic Dataset Hub, you can search any kind of dataset and always get results. The goal is to never hear "I don't have data" again from ML practitioners. Even in the most specific/custom scenario :)

211

170

56.0K

Quentin Lhoest 🤗 Retweeted

Andrew White 🐦‍⬛@andrewwhite01 · Jul 23

We have written up our analysis: futurehouse.org/research-annou… And made a gold subset on @huggingface that passed our review: huggingface.co/datasets/futur… 6/7

102

4.0K

Quentin Lhoest 🤗@lhoestq · Jul 23

🚀 We just released MultiNRC, a new benchmark for multilingual reasoning in 🇫🇷 French, 🇪🇸 Spanish, and 🇨🇳 Chinese! Frontier LLMs struggle to pass 50%. 📄 Paper: scale.com/research/multi… 📊 Leaderboard: scale.com/leaderboard/mu… #LLM #AI #NLP

SScale AI@scale_AI · Jul 23

How well do LLMs reason across languages? Introducing MultiNRC, our latest SEAL Leaderboard addition built to test native multilingual reasoning. ⬇️

315

Quentin Lhoest 🤗 Retweeted

Anton Lozhkov@anton_lozhkov · Jul 22

Hey fellow researchers and hackers, I’m looking at about a petabyte of raw code for The Next Big Dataset. What is on your wish list for code data features that you wanna see? 🎅

2.0K

Quentin Lhoest 🤗@lhoestq · Jul 23

Big news soon

QQuentin Lhoest 🤗@lhoestq · Apr 14

IT WORKS ! Finally I'm no longer GPU-poor ? ._. This CLI gives instant GPUs on @huggingface to run whatever training/eval you want And the best part is you can run as many jobs in parallel as you want x_x

421

Quentin Lhoest 🤗@lhoestq · Jul 22

Parquet Content Defined Chunking is available in PyArrow 21 :) Ideal for deduplication

448

Quentin Lhoest 🤗 Retweeted

Abdul@abdulthought · Jul 20

Excited to share dPrune: a Python library designed to make data selection and pruning simple and accessible for NLP and speech tasks:

381

Quentin Lhoest 🤗 Retweeted

Masa@getmasafi · Jul 18

Masa TikTok Datasets Now LIVE on Hugging Face 🤗 We scraped the most viral, viewed and commented TikToks of the past 2 weeks and uploaded them to @huggingface Free AI-ready TikTok transcripts + metadata to power endless use cases. Grab the Data ⬇️

3.0K

Quentin Lhoest 🤗 Retweeted

Adrien Carreira@XciD_ · Jul 17

Starting today you can run any of the 100K+ GGUFs on Hugging Face directly with Docker Run! All of it one single line: docker model run hf.co/bartowski/Llam… Excited to see how y'all will use it

242

153

18.0K

Quentin Lhoest 🤗 Retweeted

ApertureData@ApertureData · Jul 9

🥐 Summer of Workflows: Release #1 is here! Introducing Workflow #1: Croissant Ingestion A ready-to-run tool that ingests multimodal datasets directly into @ApertureData’s database from any @MLCommons Croissant URL. Try: shorturl.at/GW0W4

532

Quentin Lhoest 🤗 Retweeted

smitha milli@SmithaMilli · Jul 16

Today we're releasing Community Alignment - the largest open-source dataset of human preferences for LLMs, containing ~200k comparisons from >3000 annotators in 5 countries / languages! There was a lot of research that went into this... 🧵

310

183

31.0K

Quentin Lhoest 🤗 Retweeted

Georgia Channing@cgeorgiaw · Jul 16

You know CASP? The competition that AlphaFold won that changed the game for AI x bio? 🧬 Just dropped all the data from their last challenge on @huggingface! check it out ⤵️ huggingface.co/datasets/cgeor…

2.0K

Quentin Lhoest 🤗@lhoestq · Jul 15

Check this out! New target-specific dataset and model for binding affinity prediction! 📚 Paper: arxiv.org/abs/2507.08966 💻DualBind code: github.com/NVIDIA-Digital… 🤗 ToxBench dataset: huggingface.co/datasets/karll…

NNVIDIA Healthcare@NVIDIAHealth · Jul 15

In collaboration with @schrodinger, we present DualBind trained on ToxBench 📚 8,770 ERα AB-FEP predicted complexes (RMSE ≈ 1 kcal mol⁻¹). At last, ML has the signal to learn true binding physics - DualBind already hits r = 0.84 without shortcuts. How could this reshape…

3.0K

Quentin Lhoest 🤗 Retweeted

Vayavya Labs@vayavya · Jul 16

🚀 Exciting News for #AI in #Software Engineering! @vayavya created a C language benchmark for real-world Software Engineering tasks & evaluated popular code generation frameworks, including #windsurf, MSWE-agent, and #claudecode 🐛✨

238

Quentin Lhoest 🤗 Retweeted

Clayton Thorrez@cthorrez · Jul 16

EsportsBench refreshed with data up through June 2025, over 61k new matches across 20 esports have been recorded in the last 3 months! huggingface.co/datasets/Espor…

658

Quentin Lhoest 🤗 Retweeted

Nous Research@NousResearch · Jul 15

huggingface.co/datasets/NousR…

751

176

181.0K

Quentin Lhoest 🤗@lhoestq · Jul 15

This combined with Parquet CDC will make the Datasets on @huggingface faster to download and upload than S3

JJared Sulzdorf@j_sulz · Jul 15

We've moved the first 20PB from Git LFS to Xet on @huggingface without any interruptions, now we're migrating the rest of the Hub. We got this far by focusing on the community first. Here's a deep dive on the infra making this possible and what's next: huggingface.co/blog/migrating…

550

Quentin Lhoest 🤗 Retweeted

Li Lyna Zhang@LynaZhang · Jul 15

🚀Our rStar-Coder dataset is now released! A verified dataset of 418K competition-level code problems, each with test cases of varying difficulty. On LiveCodeBench, it boosts Qwen2.5-14B from 23.3% → 62.5%, beating o3-mini (low) by +3.1%. Try it here: huggingface.co/datasets/micro…

237

134

26.0K