Clémentine Fourrier 🍊
@clefourrier
Evals @HuggingFace 🐍✨ "The future is already here, it’s just not very evenly distributed" (Gibson) Not an AGI believer, LLMs are good at form not substance
Dear LLM twitter, I made an evaluation guidebook for you! 🥳 github.com/huggingface/ev… Goal: sharing both practical insights and theoretical knowledge the @huggingface evaluation team gathered while managing the Open LLM Leaderboard and designing lighteval! To know more: mini🧵
data of the day: just dropped a big snapshot of polar elevation data on @huggingface. 1000s of TIFFs and metadata to 32m resolution perfect for climate research, mapping, and geospatial modeling check it out: huggingface.co/datasets/cgeor… if people like this data, maybe i'll make a…
1/ Can AI file your taxes? Not yet. We tested the latest frontier models and the results were full of catastrophic errors. Letting AI do your taxes would mean IRS rejections, audits, and penalties:
HLE has recently become the benchmark to beat for frontier agents. We @FutureHouseSF took a closer look at the chem and bio questions and found about 30% of them are likely invalid based on our analysis and third-party PhD evaluations. 1/7
I'm now using Qwen3-Coder in Claude Code. Works with any model actually, but this is surely the best one currently. There are a bunch of proxies on GitHub that make this possible, but none worked well enough for me, so I implemented this myself using LiteLLM. Guide in comments:
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…
Anyone have a connection at @Alibaba_Qwen? Trying to reproduce the results on @arcprize and getting different metrics Want to get a hold of them and find out how they tested
.@arcprize listed on the @Alibaba_Qwen model card 2nd model card for us in 2 weeks Excited for ARC-AGI to be seen as a supported way to measure model performance x.com/Alibaba_Qwen/s…
signatures to look for in ai writing - > "it isn't just x, it is y" > narrative-philosophical-poetic section headings "The XYZ - A Journey of ABC" > overuse of symbolism and lofty adjectives - "stands as a testament", "plays a vital role", "underscores its importance" >…
once you see it you can't unsee it. and it's everywhere
500k samples of multilingual post-training data in 5 languages: French, Spanish, Italian, German and Portuguese. To address the lack of multilingual post-training datasets, we created these samples and found they improve performance on benchmarks like Global MMLU, Belebele, and…
very proud that my work on multi-agent debate for misinformation detection won best paper award at the @icmlconf CFAgentic workshop! check it out on arxiv: arxiv.org/abs/2410.20140 v grateful to all my co-authors and the support from @BBCRD 🥳
ARC-AGI-3 Preview games need to be pressure tested. We’re hosting a 30-day agent competition in partnership with @huggingface We’re calling on the community to build agents (and win money!) arcprize.org/competitions/a…
Today, we're announcing a preview of ARC-AGI-3, the Interactive Reasoning Benchmark with the widest gap between easy for humans and hard for AI We’re releasing: * 3 games (environments) * $10K agent contest * AI agents API Starting scores - Frontier AI: 0%, Humans: 100%
This A1 story on Meta’s data centers, and others, use of water in an age of AI is incredible. /1
An under appreciated fact about using formal methods like Lean is that it enables large-scale *collaboration* among mathematicians & potentially future AI agents. Why? Well, you can decompose a large proof into separate components that can be proven independently with robust…
🚨 According to a friend, the IMO asked AI companies not to steal the spotlight from kids and to wait a week after the closing ceremony to announce results. OpenAI announced the results BEFORE the closing ceremony. According to a Coordinator on Problem 6, the one problem OpenAI…
Just dropped on the Open ASR Leaderboard: Canary-Qwen-2.5, the latest and first-of-its-kind ASR model from the NVIDIA NeMo team. > Ranked #1 on the Open ASR Leaderboard with a WER of just 5.63% > Blazing fast with RTFx=418 on an A100 GPU for a 2.5b model! > Released under a…
Can LLMs predict the future? In FutureBench, friends from @togethercompute create new questions from evolving news & markets: As time passes, we'll see which agents are the best at predicting events that have yet to happen! 🔮 Also cool: by design, dynamic & uncontaminated eval
Most AI benchmarks test the past. But real intelligence is about predicting the future. Introducing FutureBench — a new benchmark for evaluating agents on real forecasting tasks that we developed with @huggingface 🔍 Reasoning > memorization 📊 Real-world events 🧠 Dynamic,…