Stella Li
@StellaLisy
PhD student @uwnlp | visiting researcher @AIatMeta | undergrad @jhuclsp #NLProc
Spurious Rewards was not all‼️We now present spurious PROMPTS🤔 check out our latest findings and discussion on evaluation: tinyurl.com/spurious-prompt. Who knew Lorem ipsum can bring 19.4% gains compared to default prompt👀 Also, arXiv is out🤩 arxiv.org/abs/2506.10947📄
🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…
Are audio models 🔊 really reasoning🤔 We explore a first step into audio reasoning by creating an audio benchmark ✨BLAB✨ with much longer durations (avg. 51 min)!! Such a fun project led by the amazinggggg @orevaahia 🤩🌟
🎉 We’re excited to introduce BLAB: Brutally Long Audio Bench, the first benchmark for evaluating long-form reasoning in audio LMs across 8 challenging tasks, using 833+ hours of Creative Commons audio. (avg length: 51 minutes).
🤔 How do we train AI models that surpass their teachers? 🚨 In #COLM2025: ✨Delta learning ✨makes LLM post-training cheap and easy – with only weak data, we beat open 8B SOTA 🤯 The secret? Learn from the *differences* in weak data pairs! 📜 arxiv.org/abs/2507.06187 🧵 below
FlexOlmo enables fine-grained data control on language models at test time through an anchor expert, such a cool work and great idea by @WeijiaShi2 🤩🫶
Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data…
🌿Introducing NaturalThoughts 🌿 arxiv.org/abs/2507.01921 🎯 Data curation for general reasoning capabilities is still relatively underexplored. - We systematically compare different metrics for selecting high-quality and diverse reasoning traces in terms of data efficiency in…
Worried about overfitting to IFEval? 🤔 Use ✨IFBench✨ our new, challenging instruction-following benchmark! Loved working w/ @valentina__py! Personal highlight: our multi-turn eval setting makes it possible to isolate constraint-following from the rest of the instruction 🔍
💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR. But the set of constraints and verifier functions is limited and most models overfit on IFEval. We introduce IFBench to measure model generalization to unseen constraints.
Excited to share our latest LLM self-play research! We had LLMs challenge themselves in competitive language games, showing improvements across math, logic, and reasoning benchmarks. More evidence that online RL unlocks incredible potential! 🚀
We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces…
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
🪄We made a 1B Llama BEAT GPT-4o by... making it MORE private?! LoCoMo results: 🔓GPT-4o: 80.6% 🔐1B Llama + GPT-4o (privacy): 87.7% (+7.1!⏫) 💡How? GPT-4o provides reasoning ("If X then Y"), the local model fills in the blanks with your private data to get the answer!