Stella Li

@StellaLisy

PhD student @uwnlp | visiting researcher @AIatMeta | undergrad @jhuclsp #NLProc

Seattle, WA

Joined April 2022

433Following

3KFollowers

Pinned

Stella Li@StellaLisy · Jun 13

Spurious Rewards was not all‼️We now present spurious PROMPTS🤔 check out our latest findings and discussion on evaluation: tinyurl.com/spurious-prompt. Who knew Lorem ipsum can bring 19.4% gains compared to default prompt👀 Also, arXiv is out🤩 arxiv.org/abs/2506.10947📄

SStella Li@StellaLisy · May 27

🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…

188

129

43.0K

Stella Li@StellaLisy · Jul 11

Are audio models 🔊 really reasoning🤔 We explore a first step into audio reasoning by creating an audio benchmark ✨BLAB✨ with much longer durations (avg. 51 min)!! Such a fun project led by the amazinggggg @orevaahia 🤩🌟

OOreva Ahia@orevaahia · Jul 11

🎉 We’re excited to introduce BLAB: Brutally Long Audio Bench, the first benchmark for evaluating long-form reasoning in audio LMs across 8 challenging tasks, using 833+ hours of Creative Commons audio. (avg length: 51 minutes).

2.0K

Stella Li Retweeted

Scott Geng@scottgeng00 · Jul 9

🤔 How do we train AI models that surpass their teachers? 🚨 In #COLM2025: ✨Delta learning ✨makes LLM post-training cheap and easy – with only weak data, we beat open 8B SOTA 🤯 The secret? Learn from the *differences* in weak data pairs! 📜 arxiv.org/abs/2507.06187 🧵 below

161

110

21.0K

Stella Li@StellaLisy · Jul 10

FlexOlmo enables fine-grained data control on language models at test time through an anchor expert, such a cool work and great idea by @WeijiaShi2 🤩🫶

WWeijia Shi @ ICML@WeijiaShi2 · Jul 9

Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data…

3.0K

Stella Li Retweeted

Jason Weston@jaseweston · Jul 3

🌿Introducing NaturalThoughts 🌿 arxiv.org/abs/2507.01921 🎯 Data curation for general reasoning capabilities is still relatively underexplored. - We systematically compare different metrics for selecting high-quality and diverse reasoning traces in terms of data efficiency in…

412

323

59.0K

Stella Li@StellaLisy · Jul 4

Worried about overfitting to IFEval? 🤔 Use ✨IFBench✨ our new, challenging instruction-following benchmark! Loved working w/ @valentina__py! Personal highlight: our multi-turn eval setting makes it possible to isolate constraint-following from the rest of the instruction 🔍

VValentina Pyatkin@valentina__py · Jul 3

💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR. But the set of constraints and verifier functions is limited and most models overfit on IFEval. We introduce IFBench to measure model generalization to unseen constraints.

10.0K

Stella Li@StellaLisy · Jul 1

Excited to share our latest LLM self-play research! We had LLMs challenge themselves in competitive language games, showing improvements across math, logic, and reasoning benchmarks. More evidence that online RL unlocks incredible potential! 🚀

BBo Liu (Benjamin Liu)@Benjamin_eecs · Jul 1

We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces…

2.0K

Stella Li Retweeted

CLS@ChengleiSi · Jun 30

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

169

595

202

138.0K

Stella Li Retweeted

Niloofar (✈️ ICML)@niloofar_mire · Jun 24

🪄We made a 1B Llama BEAT GPT-4o by... making it MORE private?! LoCoMo results: 🔓GPT-4o: 80.6% 🔐1B Llama + GPT-4o (privacy): 87.7% (+7.1!⏫) 💡How? GPT-4o provides reasoning ("If X then Y"), the local model fills in the blanks with your private data to get the answer!

187

106

18.0K