Rulin Shao
@RulinShao
PhD @UWNLP, visiting researcher @Meta.
Meet ReasonIR-8B✨the first retriever specifically trained for reasoning tasks! Our challenging synthetic training data unlocks SOTA scores on reasoning IR and RAG benchmarks. ReasonIR-8B ranks 1st on BRIGHT and outperforms search engine and retriever baselines on MMLU and GPQA🔥

WHY do you prefer something over another? Reward models treat preference as a black-box😶🌫️but human brains🧠decompose decisions into hidden attributes We built the first system to mirror how people really make decisions in our #COLM2025 paper🎨PrefPalette✨ Why it matters👉🏻🧵
Come chat with us at our ICML poster tomorrow! 📈 Learn about the best ways to evaluate for base language model development 🧪 Find out how you can use our suite of models over differences in pretraining distribution for your own research 😆 Get a DataDecide sticker
Ever wonder how LLM developers choose their pretraining data? It’s not guesswork— all AI labs create small-scale models as experiments, but the models and their data are rarely shared. DataDecide opens up the process: 1,050 models, 30k checkpoints, 25 datasets & 10 benchmarks 🧵
Some updates 🚨 I finished my Ph.D at @uwcse in June 2025! After a year at AI2 as a Research Scientist, I am joining CMU @LTIatCMU & @mldcmu (courtesy) as an Assistant Professor in Fall 2026. The journey, acknowledgments & recruiting in 🧵
It has been great working on the project with support from @allen_ai! I believe there are many meaningful ways different people and orgs can work together to build strong shared models, and data collaboration might be the most impactful form of it. 📄Paper:…
Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵
🤔 How do we train AI models that surpass their teachers? 🚨 In #COLM2025: ✨Delta learning ✨makes LLM post-training cheap and easy – with only weak data, we beat open 8B SOTA 🤯 The secret? Learn from the *differences* in weak data pairs! 📜 arxiv.org/abs/2507.06187 🧵 below
Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data…
Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵
Happy to share that ReasonIR is accepted by @COLM_conf! Synthetic data & test-time scaling are powerful tools to enable new capabilities for challenging tasks. I’m impressed by how quickly smaller retrievers and better rerankers have been developed with ReasonIR data! #COLM2025
Meet ReasonIR-8B✨the first retriever specifically trained for reasoning tasks! Our challenging synthetic training data unlocks SOTA scores on reasoning IR and RAG benchmarks. ReasonIR-8B ranks 1st on BRIGHT and outperforms search engine and retriever baselines on MMLU and GPQA🔥
🚀 Last year: MassiveDS-1.4T showed great scaling gains with a web-scale datastore but was too heavy for online production ✨ Now: CompactDS is here! Better performance, compact size, ready for agentic apps & Deep Research RL training Kudos to @XinxiLyu @micdun8 for leading this!
Reasoning benchmarks (e.g., MMLU Pro and GPQA) have seen little benefit from naive RAG. But can we flip this? 🔥Introducing CompactDS: ✅Web-scale coverage ✅Runs with just 100GB RAM ✅Matches search engines The simplest RAG pipeline can even compete with agentic…
Worried about overfitting to IFEval? 🤔 Use ✨IFBench✨ our new, challenging instruction-following benchmark! Loved working w/ @valentina__py! Personal highlight: our multi-turn eval setting makes it possible to isolate constraint-following from the rest of the instruction 🔍
💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR. But the set of constraints and verifier functions is limited and most models overfit on IFEval. We introduce IFBench to measure model generalization to unseen constraints.
Seven years ago, I co-led a paper called 𝗛𝗼𝘁𝗽𝗼𝘁𝗤𝗔 that has motivated and facilitated many #AI #Agents research works since. Today, I'm asking that you stop using HotpotQA blindly for agents research in 2025 and beyond. In my new blog post, I revisit the brief history of…
We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces…
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
Web data, the “fossil fuel of AI”, is being exhausted. What’s next?🤔 We propose Recycling the Web to break the data wall of pretraining via grounded synthetic data. It is more effective than standard data filtering methods, even with multi-epoch repeats! arxiv.org/abs/2506.04689
It reminds me of the cognitive behaviors that have been found to help reasoning—backtracking, subgoal setting, verifications, etc.—they all seem to fit this parallel generation pattern better than linearly chaining them. Looking forward to trying it out!
🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46% 🌐 Website: multiverse4fm.github.io 🧵 1/n
Honored to be part of organizing the LM4Sci workshop at #COLM2025! 🔬🤖 We invite submissions that demonstrate innovative approaches to scientific reasoning and discovery. Submit by June 23! 🚀
🚨 Call for Papers: LM4Sci @COLM_conf 2025 🚨 Excited to announce the Large Language Modeling for Scientific Discovery (LM4Sci) workshop at COLM 2025 in Montreal, Canada! Submission Deadline: June 23 Notification: July 24 Workshop: October 10, 2025
🔥Exited to share our new work on reproducibility challenges in reasoning models caused by numerical precision. Ever run the same prompt twice and get completely different answers from your LLM under greedy decoding? You're not alone. Most LLMs today default to BF16 precision,…
One key takeaway from recent work on test-time compute: even a small weight update can make a big difference. So, what happens if we meta-learn those updates (and not necessarily at test time)? Excited to share this new work led by @AdamZweiger and @jyo_pari!
What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.