Quentin Lhoest 🤗
@lhoestq
Datasets @huggingface | Open Source + HF Dataset Hub
Happy to share this new app: The 🤗 Infinite Dataset Hub ♾️ It's a 100% synthetic Dataset Hub, you can search any kind of dataset and always get results. The goal is to never hear "I don't have data" again from ML practitioners. Even in the most specific/custom scenario :)
We have written up our analysis: futurehouse.org/research-annou… And made a gold subset on @huggingface that passed our review: huggingface.co/datasets/futur… 6/7
🚀 We just released MultiNRC, a new benchmark for multilingual reasoning in 🇫🇷 French, 🇪🇸 Spanish, and 🇨🇳 Chinese! Frontier LLMs struggle to pass 50%. 📄 Paper: scale.com/research/multi… 📊 Leaderboard: scale.com/leaderboard/mu… #LLM #AI #NLP
How well do LLMs reason across languages? Introducing MultiNRC, our latest SEAL Leaderboard addition built to test native multilingual reasoning. ⬇️
Hey fellow researchers and hackers, I’m looking at about a petabyte of raw code for The Next Big Dataset. What is on your wish list for code data features that you wanna see? 🎅
Big news soon
IT WORKS ! Finally I'm no longer GPU-poor ? ._. This CLI gives instant GPUs on @huggingface to run whatever training/eval you want And the best part is you can run as many jobs in parallel as you want x_x
Parquet Content Defined Chunking is available in PyArrow 21 :) Ideal for deduplication

Excited to share dPrune: a Python library designed to make data selection and pruning simple and accessible for NLP and speech tasks:
Masa TikTok Datasets Now LIVE on Hugging Face 🤗 We scraped the most viral, viewed and commented TikToks of the past 2 weeks and uploaded them to @huggingface Free AI-ready TikTok transcripts + metadata to power endless use cases. Grab the Data ⬇️
Starting today you can run any of the 100K+ GGUFs on Hugging Face directly with Docker Run! All of it one single line: docker model run hf.co/bartowski/Llam… Excited to see how y'all will use it
🥐 Summer of Workflows: Release #1 is here! Introducing Workflow #1: Croissant Ingestion A ready-to-run tool that ingests multimodal datasets directly into @ApertureData’s database from any @MLCommons Croissant URL. Try: shorturl.at/GW0W4
Today we're releasing Community Alignment - the largest open-source dataset of human preferences for LLMs, containing ~200k comparisons from >3000 annotators in 5 countries / languages! There was a lot of research that went into this... 🧵
You know CASP? The competition that AlphaFold won that changed the game for AI x bio? 🧬 Just dropped all the data from their last challenge on @huggingface! check it out ⤵️ huggingface.co/datasets/cgeor…
Check this out! New target-specific dataset and model for binding affinity prediction! 📚 Paper: arxiv.org/abs/2507.08966 💻DualBind code: github.com/NVIDIA-Digital… 🤗 ToxBench dataset: huggingface.co/datasets/karll…
In collaboration with @schrodinger, we present DualBind trained on ToxBench 📚 8,770 ERα AB-FEP predicted complexes (RMSE ≈ 1 kcal mol⁻¹). At last, ML has the signal to learn true binding physics - DualBind already hits r = 0.84 without shortcuts. How could this reshape…
🚀 Exciting News for #AI in #Software Engineering! @vayavya created a C language benchmark for real-world Software Engineering tasks & evaluated popular code generation frameworks, including #windsurf, MSWE-agent, and #claudecode 🐛✨
EsportsBench refreshed with data up through June 2025, over 61k new matches across 20 esports have been recorded in the last 3 months! huggingface.co/datasets/Espor…
This combined with Parquet CDC will make the Datasets on @huggingface faster to download and upload than S3
We've moved the first 20PB from Git LFS to Xet on @huggingface without any interruptions, now we're migrating the rest of the Hub. We got this far by focusing on the community first. Here's a deep dive on the infra making this possible and what's next: huggingface.co/blog/migrating…
🚀Our rStar-Coder dataset is now released! A verified dataset of 418K competition-level code problems, each with test cases of varying difficulty. On LiveCodeBench, it boosts Qwen2.5-14B from 23.3% → 62.5%, beating o3-mini (low) by +3.1%. Try it here: huggingface.co/datasets/micro…