Rulin Shao

@RulinShao

PhD @UWNLP, visiting researcher @Meta.

Joined April 2022

827Following

4KFollowers

Pinned

Rulin Shao@RulinShao · May 1

Meet ReasonIR-8B✨the first retriever specifically trained for reasoning tasks! Our challenging synthetic training data unlocks SOTA scores on reasoning IR and RAG benchmarks. ReasonIR-8B ranks 1st on BRIGHT and outperforms search engine and retriever baselines on MMLU and GPQA🔥

RulinShao's tweet image. Meet ReasonIR-8B✨the first retriever specifically trained for reasoning tasks! Our challenging synthetic training data unlocks SOTA scores on reasoning IR and RAG benchmarks. ReasonIR-8B ranks 1st on BRIGHT and outperforms search engine and retriever baselines on MMLU and GPQA🔥

351

178

50.0K

Rulin Shao Retweeted

Stella Li ➡️ CogSci2025@StellaLisy · Jul 22

WHY do you prefer something over another? Reward models treat preference as a black-box😶‍🌫️but human brains🧠decompose decisions into hidden attributes We built the first system to mirror how people really make decisions in our #COLM2025 paper🎨PrefPalette✨ Why it matters👉🏻🧵

367

259

40.0K

Rulin Shao@RulinShao · Jul 15

Come chat with us at our ICML poster tomorrow! 📈 Learn about the best ways to evaluate for base language model development 🧪 Find out how you can use our suite of models over differences in pretraining distribution for your own research 😆 Get a DataDecide sticker

AAi2@allen_ai · Apr 15

Ever wonder how LLM developers choose their pretraining data? It’s not guesswork— all AI labs create small-scale models as experiments, but the models and their data are rarely shared. DataDecide opens up the process: 1,050 models, 30k checkpoints, 25 datasets & 10 benchmarks 🧵

5.0K

Rulin Shao Retweeted

Akari Asai@AkariAsai · Jul 15

Some updates 🚨 I finished my Ph.D at @uwcse in June 2025! After a year at AI2 as a Research Scientist, I am joining CMU @LTIatCMU & @mldcmu (courtesy) as an Assistant Professor in Fall 2026. The journey, acknowledgments & recruiting in 🧵

114

1.0K

107

103.0K

Rulin Shao@RulinShao · Jul 9

It has been great working on the project with support from @allen_ai! I believe there are many meaningful ways different people and orgs can work together to build strong shared models, and data collaboration might be the most impactful form of it. 📄Paper:…

AAi2@allen_ai · Jul 9

Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵

192

28.0K

Rulin Shao Retweeted

Scott Geng@scottgeng00 · Jul 9

🤔 How do we train AI models that surpass their teachers? 🚨 In #COLM2025: ✨Delta learning ✨makes LLM post-training cheap and easy – with only weak data, we beat open 8B SOTA 🤯 The secret? Learn from the *differences* in weak data pairs! 📜 arxiv.org/abs/2507.06187 🧵 below

161

110

21.0K

Rulin Shao@RulinShao · Jul 9

Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data…

AAi2@allen_ai · Jul 9

Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵

269

52.0K

Rulin Shao@RulinShao · Jul 8

Happy to share that ReasonIR is accepted by @COLM_conf! Synthetic data & test-time scaling are powerful tools to enable new capabilities for challenging tasks. I’m impressed by how quickly smaller retrievers and better rerankers have been developed with ReasonIR data! #COLM2025

RRulin Shao@RulinShao · May 1

134

10.0K

Rulin Shao@RulinShao · Jul 7

🚀 Last year: MassiveDS-1.4T showed great scaling gains with a web-scale datastore but was too heavy for online production ✨ Now: CompactDS is here! Better performance, compact size, ready for agentic apps & Deep Research RL training Kudos to @XinxiLyu @micdun8 for leading this!

XXinxi Lyu@XinxiLyu · Jul 7

Reasoning benchmarks (e.g., MMLU Pro and GPQA) have seen little benefit from naive RAG. But can we flip this? 🔥Introducing CompactDS: ✅Web-scale coverage ✅Runs with just 100GB RAM ✅Matches search engines The simplest RAG pipeline can even compete with agentic…

9.0K

Rulin Shao@RulinShao · Jul 4

Worried about overfitting to IFEval? 🤔 Use ✨IFBench✨ our new, challenging instruction-following benchmark! Loved working w/ @valentina__py! Personal highlight: our multi-turn eval setting makes it possible to isolate constraint-following from the rest of the instruction 🔍

VValentina Pyatkin@valentina__py · Jul 3

💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR. But the set of constraints and verifier functions is limited and most models overfit on IFEval. We introduce IFBench to measure model generalization to unseen constraints.

10.0K

Rulin Shao Retweeted

Peng Qi@qi2peng2 · Jul 2

Seven years ago, I co-led a paper called 𝗛𝗼𝘁𝗽𝗼𝘁𝗤𝗔 that has motivated and facilitated many #AI #Agents research works since. Today, I'm asking that you stop using HotpotQA blindly for agents research in 2025 and beyond. In my new blog post, I revisit the brief history of…

227

123

37.0K

Rulin Shao Retweeted

Bo Liu (Benjamin Liu)@Benjamin_eecs · Jul 1

We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces…

268

179

62.0K

Rulin Shao Retweeted

CLS@ChengleiSi · Jun 30

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

170

597

204

138.0K

Rulin Shao Retweeted

Thao Nguyen@thao_nguyen26 · Jun 23

Web data, the “fossil fuel of AI”, is being exhausted. What’s next?🤔 We propose Recycling the Web to break the data wall of pretraining via grounded synthetic data. It is more effective than standard data filtering methods, even with multi-epoch repeats! arxiv.org/abs/2506.04689

220

125

32.0K

Rulin Shao@RulinShao · Jun 17

It reminds me of the cognitive behaviors that have been found to help reasoning—backtracking, subgoal setting, verifications, etc.—they all seem to fit this parallel generation pattern better than linearly chaining them. Looking forward to trying it out!

IInfini-AI-Lab@InfiniAILab · Jun 16

🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46% 🌐 Website: multiverse4fm.github.io 🧵 1/n

3.0K

Rulin Shao@RulinShao · Jun 15

Honored to be part of organizing the LM4Sci workshop at #COLM2025! 🔬🤖 We invite submissions that demonstrate innovative approaches to scientific reasoning and discovery. Submit by June 23! 🚀

LLM4SCI @ COLM2025@lm4sci · Jun 13

🚨 Call for Papers: LM4Sci @COLM_conf 2025 🚨 Excited to announce the Large Language Modeling for Scientific Discovery (LM4Sci) workshop at COLM 2025 in Montreal, Canada! Submission Deadline: June 23 Notification: July 24 Workshop: October 10, 2025

4.0K

Rulin Shao Retweeted

Zirui Liu@ziruirayliu · Jun 12

🔥Exited to share our new work on reproducibility challenges in reasoning models caused by numerical precision. Ever run the same prompt twice and get completely different answers from your LLM under greedy decoding? You're not alone. Most LLMs today default to BF16 precision,…

13.0K

Rulin Shao@RulinShao · Jun 13

One key takeaway from recent work on test-time compute: even a small weight update can make a big difference. So, what happens if we meta-learn those updates (and not necessarily at test time)? Excited to share this new work led by @AdamZweiger and @jyo_pari!

JJyo Pari@jyo_pari · Jun 13

What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.

5.0K