Sean Welleck
@wellecks
Assistant Professor at CMU. Marathoner, @thesisreview.
Another AI system, ByteDance's SeedProver solved 4 out of 6 IMO problems *with* Lean, and solved a fifth with extended compute. This is becoming routine, like when we went to the moon for the fourth time. There is *nothing* "routine" about this!!...
😂 @wellecks , i think this “challenging problem” may have been finally solved after five years. === Understanding and creating mathematics using natural mathematical language … used by humans is a challenging and important problem for driving progress in machine learning. ===
AlphaVerus – today at ICML!
Can LLMs self-improve on code generation? Check out our work AlphaVerus where model generates provably correct code and self-improves without any weight updates! At #ICML2025 today: 📆: 11:00 AM - 1:30 PM 📷: Poster #East-2912 alphaverus.github.io w/ Bryan, @wellecks
Can LLMs self-improve on code generation? Check out our work AlphaVerus where model generates provably correct code and self-improves without any weight updates! At #ICML2025 today: 📆: 11:00 AM - 1:30 PM 📷: Poster #East-2912 alphaverus.github.io w/ Bryan, @wellecks
Huge congratulations to Vaishnavh, Chen and Charles on the outstanding paper award 🎉 We will be presenting our #ICML2025 work on creativity in the Oral 3A Reasoning session (West Exhibition Hall C) 10 - 11 am PT. Or please stop by our poster right after @ East Exhibition…
📢 New paper on creativity & multi-token prediction! We design minimal open-ended tasks to argue: → LLMs are limited in creativity since they learn to predict the next token → creativity can be improved via multi-token learning & injecting noise ("seed-conditioning" 🌱) 1/ 🧵
Some updates 🚨 I finished my Ph.D at @uwcse in June 2025! After a year at AI2 as a Research Scientist, I am joining CMU @LTIatCMU & @mldcmu (courtesy) as an Assistant Professor in Fall 2026. The journey, acknowledgments & recruiting in 🧵
I will be at #ICML2025 this week. Reach out if you want to chat about llm reasoning, computer-use agents, code gen or actually anything! (DMs are open) I will also be presenting AlphaVerus (self-improving verified code gen) this Thursday! alphaverus.github.io
L1 is heading to COLM! We've released 5 new open L1 models and the Massive-Math dataset to celebrate:
Super excited to see L1 accepted to #COLM2025! We are further open-sourcing 5 new models & a dataset: 1. L1-7B & L1-8B: Exact and Max variants 2. L1-1.5B-Short: Short reasoning model (SRM), RL-trained on 1.2M data points 3. Massive-Math-455K: A clean, unified math dataset 🧵
🚨 Deadline for SCALR 2025 Workshop: Test‑time Scaling & Reasoning Models at COLM '25 @COLM_conf is approaching!🚨 scalr-workshop.github.io 🧩 Call for short papers (4 pages, non‑archival) now open on OpenReview! Submit by June 23, 2025; notifications out July 24. Topics…
Really nice work based on inference scaling laws that account for memory accesses. Very insightful!
🥳 Happy to share our new work – Kinetics: Rethinking Test-Time Scaling Laws 🤔How to effectively build a powerful reasoning agent? Existing compute-optimal scaling laws suggest 64K thinking tokens + 1.7B model > 32B model. But, It only shows half of the picture! 🚨 The O(N²)…
[LG] Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening A He, D Fried, S Welleck [CMU] (2025) arxiv.org/abs/2506.02355
🥳 Happy to share our new work – Kinetics: Rethinking Test-Time Scaling Laws 🤔How to effectively build a powerful reasoning agent? Existing compute-optimal scaling laws suggest 64K thinking tokens + 1.7B model > 32B model. But, It only shows half of the picture! 🚨 The O(N²)…
In the test time scaling era, we all would love a higher throughput serving engine! Introducing Tokasaurus, a LLM inference engine for high-throughput workloads with large and small models! Led by @jordanjuravsky, in collaboration with @HazyResearch and an amazing team!
Happy Throughput Thursday! We’re excited to release Tokasaurus: an LLM inference engine designed from the ground up for high-throughput workloads with large and small models. (Joint work with @achakravarthy01, @ryansehrlich, @EyubogluSabri, @brad19brown, @jshetaye,…
There's lots of RL goodies in the tech report behind @FutureHouseSF's new reasoning model for chemistry 👀 Three things stood out to me: 1. Training domain-specific experts in parallel, before distilling into a generalist model. The clever thing here is that you can parallelise…
Check out log-linear attention—our latest approach to overcoming the fundamental limitation of RNNs’ constant state size, while preserving subquadratic time and space complexity
We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels
Simple yet cool idea. I find it interesting how the community is now cares more about pass@k than pass@1 eval which dominated the field over the last 5-6 months
New paper by Andre He: Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening arxiv.org/abs/2506.02355 Tired of sharpening the distribution? Try unlikeliness reward to learn new things from the roads less traveled
I believe the next big test for LLMs is whether they can generate truly novel ideas in open-ended situations. We translate notions of "creativity" from cogsci into simple tasks that reveal how far today’s models fall, and how multi-token training + randomness might help.
📢 New paper on creativity & multi-token prediction! We design minimal open-ended tasks to argue: → LLMs are limited in creativity since they learn to predict the next token → creativity can be improved via multi-token learning & injecting noise ("seed-conditioning" 🌱) 1/ 🧵
‘Bold,’ ‘positive’ and ‘unparalleled’: Allen School Ph.D. graduates Ashish Sharma and Sewon Min recognized with ACM Doctoral Dissertation Awards news.cs.washington.edu/2025/06/04/all… Massive congrats to @sharma_ashish_2 and @sewon__min - huge win for @uwnlp and the broader NLP community! 🙌
Unlikeliness reward dramatically changes how GRPO uplifts low probability vs. high probability sequences, leading to improved pass@N for high N. It also improves sample diversity, e.g. measured by unique proofs generated.
We found that GRPO suffers from what we call a rank bias: reinforcing high probability correct outputs, but not low probability correct outputs (left plot) However, we argue that increasing low-probability correct outputs is important for improving pass@N (right plot)