Clémentine Fourrier 🍊

@clefourrier

Evals @HuggingFace 🐍✨ "The future is already here, it’s just not very evenly distributed" (Gibson) Not an AGI believer, LLMs are good at form not substance

Joined October 2019

396Following

5KFollowers

Pinned

Clémentine Fourrier 🍊@clefourrier · Oct 10

Dear LLM twitter, I made an evaluation guidebook for you! 🥳 github.com/huggingface/ev… Goal: sharing both practical insights and theoretical knowledge the @huggingface evaluation team gathered while managing the Open LLM Leaderboard and designing lighteval! To know more: mini🧵

clefourrier's tweet card. Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval! - huggingface/evaluation-guidebook

118

610

581

107.0K

Pinned

Clémentine Fourrier 🍊 Retweeted

Georgia Channing@cgeorgiaw · Jul 21

data of the day: just dropped a big snapshot of polar elevation data on @huggingface. 1000s of TIFFs and metadata to 32m resolution perfect for climate research, mapping, and geospatial modeling check it out: huggingface.co/datasets/cgeor… if people like this data, maybe i'll make a…

601

Clémentine Fourrier 🍊 Retweeted

Michael R. Bock@michaelrbock · Jul 23

1/ Can AI file your taxes? Not yet. We tested the latest frontier models and the results were full of catastrophic errors. Letting AI do your taxes would mean IRS rejections, audits, and penalties:

240

134

67.0K

Clémentine Fourrier 🍊 Retweeted

paul rudd@philsadelphia · Jul 24

telling my grandchildren this was chatgpt

101

7.0K

41.0K

815

1.1M

Clémentine Fourrier 🍊 Retweeted

Andrew White 🐦‍⬛@andrewwhite01 · Jul 23

HLE has recently become the benchmark to beat for frontier agents. We @FutureHouseSF took a closer look at the chem and bio questions and found about 30% of them are likely invalid based on our analysis and third-party PhD evaluations. 1/7

584

170

116.0K

Clémentine Fourrier 🍊@clefourrier · Jul 23

I'm now using Qwen3-Coder in Claude Code. Works with any model actually, but this is surely the best one currently. There are a bunch of proxies on GitHub that make this possible, but none worked well enough for me, so I implemented this myself using LiteLLM. Guide in comments:

QQwen@Alibaba_Qwen · Jul 22

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…

537

623

69.0K

Clémentine Fourrier 🍊@clefourrier · Jul 22

Anyone have a connection at @Alibaba_Qwen? Trying to reproduce the results on @arcprize and getting different metrics Want to get a hold of them and find out how they tested

GGreg Kamradt@GregKamradt · Jul 21

.@arcprize listed on the @Alibaba_Qwen model card 2nd model card for us in 2 weeks Excited for ARC-AGI to be seen as a supported way to measure model performance x.com/Alibaba_Qwen/s…

10.0K

Clémentine Fourrier 🍊@clefourrier · Jul 23

signatures to look for in ai writing - > "it isn't just x, it is y" > narrative-philosophical-poetic section headings "The XYZ - A Journey of ABC" > overuse of symbolism and lofty adjectives - "stands as a testament", "plays a vital role", "underscores its importance" >…

BBen (no treats)@andersonbcdefg · Jul 23

once you see it you can't unsee it. and it's everywhere

198

159

20.0K

Clémentine Fourrier 🍊 Retweeted

Loubna Ben Allal@LoubnaBenAllal1 · Jul 21

500k samples of multilingual post-training data in 5 languages: French, Spanish, Italian, German and Portuguese. To address the lack of multilingual post-training datasets, we created these samples and found they improve performance on benchmarks like Global MMLU, Belebele, and…

175

10.0K

Clémentine Fourrier 🍊 Retweeted

Georgia Channing@cgeorgiaw · Jul 21

very proud that my work on multi-agent debate for misinformation detection won best paper award at the @icmlconf CFAgentic workshop! check it out on arxiv: arxiv.org/abs/2410.20140 v grateful to all my co-authors and the support from @BBCRD 🥳

647

Clémentine Fourrier 🍊 Retweeted

ARC Prize@arcprize · Jul 18

ARC-AGI-3 Preview games need to be pressure tested. We’re hosting a 30-day agent competition in partnership with @huggingface We’re calling on the community to build agents (and win money!) arcprize.org/competitions/a…

108

24.0K

Clémentine Fourrier 🍊 Retweeted

ARC Prize@arcprize · Jul 18

Today, we're announcing a preview of ARC-AGI-3, the Interactive Reasoning Benchmark with the widest gap between easy for humans and hard for AI We’re releasing: * 3 games (environments) * $10K agent contest * AI agents API Starting scores - Frontier AI: 0%, Humans: 100%

225

2.0K

549

330.0K

Clémentine Fourrier 🍊 Retweeted

Jason Kint@jason_kint · Jul 20

This A1 story on Meta’s data centers, and others, use of water in an age of AI is incredible. /1

779

2.0K

550

1.4M

Clémentine Fourrier 🍊 Retweeted

Lewis Tunstall@_lewtun · Jul 20

An under appreciated fact about using formal methods like Lean is that it enables large-scale *collaboration* among mathematicians & potentially future AI agents. Why? Well, you can decompose a large proof into separate components that can be proven independently with robust…

3.0K

Clémentine Fourrier 🍊 Retweeted

paris martineau@parismartineau · Jul 20

stop 👏 anthropomorphizing 👏 the 👏 LLM 👏

599

11.0K

271

188.0K

Clémentine Fourrier 🍊 Retweeted

Mikhail Samin@Mihonarium · Jul 20

🚨 According to a friend, the IMO asked AI companies not to steal the spotlight from kids and to wait a week after the closing ceremony to announce results. OpenAI announced the results BEFORE the closing ceremony. According to a Coordinator on Problem 6, the one problem OpenAI…

197

2.0K

508

455.0K

Clémentine Fourrier 🍊 Retweeted

steven@Tu7uruu · Jul 18

Just dropped on the Open ASR Leaderboard: Canary-Qwen-2.5, the latest and first-of-its-kind ASR model from the NVIDIA NeMo team. > Ranked #1 on the Open ASR Leaderboard with a WER of just 5.63% > Blazing fast with RTFx=418 on an A100 GPU for a 2.5b model! > Released under a…

133

5.0K

Clémentine Fourrier 🍊 Retweeted

Clémentine Fourrier 🍊@clefourrier · Jul 17

Can LLMs predict the future? In FutureBench, friends from @togethercompute create new questions from evolving news & markets: As time passes, we'll see which agents are the best at predicting events that have yet to happen! 🔮 Also cool: by design, dynamic & uncontaminated eval

4.0K

Clémentine Fourrier 🍊 Retweeted

Together AI@togethercompute · Jul 17

Most AI benchmarks test the past. But real intelligence is about predicting the future. Introducing FutureBench — a new benchmark for evaluating agents on real forecasting tasks that we developed with @huggingface 🔍 Reasoning > memorization 📊 Real-world events 🧠 Dynamic,…

25.0K