Simon Shaolei Du
@SimonShaoleiDu
Assistant Professor @uwcse. Postdoc @the_IAS. PhD in machine learning @mldcmu.
Can transformers analyze code efficiently? ✅ Yes. We prove transformers efficiently handle real compiler tasks (AST construction, symbol resolution, type infer) using only log size—while RNNs require linear size (in input length). Paper: arxiv.org/abs/2410.14706 #COLM2025
🚨 Code is live! Check out LoRe – a modular, lightweight codebase for personalized reward modeling from user preferences. 📦 Few-shot personalization 📊 Benchmarks: TLDR, PRISM, PersonalLLM 👉 github.com/facebookresear… Huge thanks to @AIatMeta for open-sourcing this research 🙌
🧠 Your LLM should model how you think, not reduce you to preassigned traits 📢 Introducing LoRe: a low-rank reward modeling framework for personalized RLHF ❌ Demographic grouping/handcrafted traits ✅ Infers implicit preferences ✅ Few-shot adaptation 📄 arxiv.org/abs/2504.14439
I'll present StoryEval tomorrow at CVPR, happy to catch up with new and old friends! 📍ExHall D, Poster #284 ⌚10.30am - 12.30 pm at 6.14
Can the current best T2V generative models (Veo2, Kling, Sora, Gen-3, Pika, Hailuo, ...) completely present short stories like “How to Put an Elephant in a Refrigerator”? 🐘 Not yet! Simple stories containing multiple sequential events, such as “opens the refrigerator door” 🚪,…
Excited to share our work led by @ypwang61 RLVR with only ONE training example can boost 37% accuracy on MATH500.
We only need ONE example for RLVR on LLMs to achieve significant improvement on math tasks! 📍RLVR with one training example can boost: - Qwen2.5-Math-1.5B: 36.0% → 73.6% - Qwen2.5-Math-7B: 51.0% → 79.2% on MATH500. 📄 Paper: arxiv.org/abs/2504.20571…
🧠 Your LLM should model how you think, not reduce you to preassigned traits 📢 Introducing LoRe: a low-rank reward modeling framework for personalized RLHF ❌ Demographic grouping/handcrafted traits ✅ Infers implicit preferences ✅ Few-shot adaptation 📄 arxiv.org/abs/2504.14439
1/6 Current AI agent training methods fail to capture diverse behaviors needed for human-AI cooperation. GOAT (Generative Online Adversarial Training) uses online adversarial training to explore a pre-trained generative model's latent space to generate realistic yet challenging…
Check out our new work using online multi-agent RL for LM safety.
🤔Conventional LM safety alignment is reactive: find vulnerabilities→patch→repeat 🌟We propose 𝗼𝗻𝗹𝗶𝗻𝗲 𝐦𝐮𝐥𝐭𝐢-𝐚𝐠𝐞𝐧𝐭 𝗥𝗟 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 where Attacker & Defender self-play to co-evolve, finding diverse attacks and improving safety by up to 72% vs. RLHF 🧵
Oral @icmlconf !!! Can't wait to share our work and hear the community's thoughts on it, should be a fun talk! Can't thank my collaborators enough: @cogscikid @liangyanchenggg @SimonShaoleiDu @maxhkw @natashajaques
Our new paper (first one of my PhD!) on cooperative AI reveals a surprising insight: Environment Diversity > Partner Diversity. Agents trained in self-play across many environments learn cooperative norms that transfer to humans on novel tasks. shorturl.at/fqsNN🧵
Congratulations to @UW #UWAllen Ph.D. grads @sharma_ashish_2 & @sewon__min, @TheOfficialACM Doctoral Dissertation Award honorees! Sharma won for #AI tools for mental health; Min received honorable mention for efficient, flexible language models. #ThisIsUW news.cs.washington.edu/2025/06/04/all…
PPO vs. DPO? 🤔 Our new paper proves that it depends on whether your models can represent the optimal policy and/or reward. Paper: arxiv.org/abs/2505.19770 Led by @smellycat_ZZZ @MinhakSong
Two-stage RLHF or one-stage DPO: Which one is better for learning from preferences? Equal under strong assumptions, but representation differences break the tie. Our paper reveals their fine-grained performance gaps under various conditions. paper: arxiv.org/abs/2505.19770
Our new paper tries to uncover what we really need in applying RLVR.
🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…
Even with the same vision encoder, generative VLMs (LLaVA) can extract more information than CLIP. Why? Check out our #ACL2025NLP paper led by @SitingLi627 : arxiv.org/pdf/2411.05195
Excited to share that our paper "Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder" is accepted to #ACL2025! Preprint: arxiv.org/pdf/2411.05195 Thank @SimonShaoleiDu and @PangWeiKoh so much for your support and guidance throughout the journey!
Famous LLM researcher Bruce Lee quote: "I fear not the LLM who has practiced 10,000 questions once, but I fear the LLM who has practiced one question 10,000 times."
So excited to announce our work was accepted as a Spotlight paper to @icmlconf !!! I'm looking forward to presenting our work there this summer and @cogsci_soc! Big thank you again to my collaborators @cogscikid @liangyanchenggg @SimonShaoleiDu @maxhkw @natashajaques
Our new paper (first one of my PhD!) on cooperative AI reveals a surprising insight: Environment Diversity > Partner Diversity. Agents trained in self-play across many environments learn cooperative norms that transfer to humans on novel tasks. shorturl.at/fqsNN🧵
Sampler is crucial for faster convergence of online DPO! Check out out paper: arxiv.org/abs/2409.19605 #ICLR2025
Previous works study the sample complexity of DPO and emphasize the role of samplers in online DPO. What about its role in optimization convergence rates? Check out our paper at #ICLR2025 on convergence rates of online DPO with various samplers! ArXiv: arxiv.org/pdf/2409.19605.
Excited to share our new work led by @kjha02 : scaling training to more diverse environments is key to human-AI cooperation!
Our new paper (first one of my PhD!) on cooperative AI reveals a surprising insight: Environment Diversity > Partner Diversity. Agents trained in self-play across many environments learn cooperative norms that transfer to humans on novel tasks. shorturl.at/fqsNN🧵