Yixin Ye

@BLeavesYe

Undergrad @sjtu1896. Intern @ GAIR Lab (http://plms.ai) Visiting @stanfordnlp. NLP/LLMs/Reasoning. Looking for a Ph.D. in the 26 fall.

Palo Alto, California

Joined July 2024

124Following

468Followers

Pinned

Yixin Ye@BLeavesYe · Feb 6

🤔 How many examples does an LLM need to learn competition-level math? Conventional wisdom: 100,000+ examples Our discovery: Just 817 carefully chosen ones 🤩 With pure SFT, LIMO achieves: 57.1% on AIME 94.8% on MATH LIMO: Less is More for Reasoning 📝 🔗 arxiv.org/pdf/2502.03387

BLeavesYe's tweet image. 🤔 How many examples does an LLM need to learn competition-level math?
Conventional wisdom: 100,000+ examples
Our discovery: Just 817 carefully chosen ones 🤩
With pure SFT, LIMO achieves:
57.1% on AIME
94.8% on MATH

LIMO: Less is More for Reasoning 📝
🔗 arxiv.org/pdf/2502.03387

109

704

465

92.0K

Yixin Ye@BLeavesYe · Jul 9

🎉 Excited to announce that LIMO has been accepted by COLM2025 @COLM_conf ! We'll be releasing an updated paper soon with detailed data construction processes and a new version of dataset - smaller in size but with better performance. Stay tuned!

YYixin Ye@BLeavesYe · Feb 6

964

Yixin Ye Retweeted

Zengzhi Wang@SinclairWang1 · Jun 26

What Makes a Base Language Model Suitable for RL? Rumors in the community say RL (i.e., RLVR) on LLMs is full of “mysteries”: (1) Is the magic only happening on Qwen + Math? (2) Does the "aha moment" only spark during math reasoning? (3) Is evaluation hiding some tricky traps?…

506

477

89.0K

Yixin Ye@BLeavesYe · Feb 10

Third-party evaluations highlight LIMO's strong generalization capabilities on AIME 2025!!!

WWenhu Chen@WenhuChen · Feb 9

I spent some time evaluating the frontier math models on AIME24 and AIME25 to see how they "Generalize". An interesting trend I found is that SFT on minimum data can also generalize quite well if you pick the right data. See LIMO-32B. Training with RL does not necessarily lead…

6.0K

Yixin Ye@BLeavesYe · Feb 8

Thrilled to see LIMO dataset making such an immediate impact! A 10-point boost on AIME24 and GPQA, with 3-point improvement on MATH-500 is truly exciting. Welcome more researchers to explore and experiment. Together we can push the boundaries of efficient mathematical reasoning…

LLewis Tunstall@_lewtun · Feb 8

I'm running a shit-ton of GRPO experiments on DeepSeek's distilled models with the LIMO dataset and it really works well 🔥! Depending on the hyperparameters, I'm able to get ~10 point boost on AIME24 and GPQA, with ~3 point boost on MATH-500 (likely saturated). Link with more…

10.0K