Etash Guha
@etash_guha
Ph.D. @Stanford and @uwcse
Turns out, it’s possible to outperform DeepSeekR1-32B with only SFT on open data and no RL: Announcing OpenThinker2-32B and OpenThinker2-7B. We also release the data, OpenThoughts2-1M, curated by selecting quality instructions from diverse sources. 🧵 (1/n)

Qwen3 Coder highlights that data scaling is critical for code RL. They ran 20,000 environments in parallel. Most OSS RL datasets have only 20,000 datapoints total. This data may be one of the biggest gaps between what frontier labs and academic labs have.

The set of agentic evals that Qwen/Anthropic look at are so different than the reasoning evals that OpenAI/GDM work on. I wonder whether these different bets will converge again or diverge more.
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…
Quick Update: I’ve officially started my PhD at Stanford (go trees i think?!?)! After an amazing year at UW (go huskies!), I’m super happy to continue my CS PhD with my amazing advisors @lschmidt3 and @YejinChoinka! If you see me on campus, please say hi and listen to me rant…
OpenThinker3-1.5B: a compact reasoning model fine-tuned from Qwen2.5-1.5B-Instruct on OpenThoughts3-1.2M, a filtered dataset of math, code, and science QA. - +10.1 avg over R1-Distill-1.5B across math/code/science tasks - Within 2pts of Qwen3-1.7B (closed data) - SOTA at 1.5B…
Smaller models can reason very well!
📢📢📢 Releasing OpenThinker3-1.5B, the top-performing SFT-only model at the 1B scale! 🚀 OpenThinker3-1.5B is a smaller version of our previous 7B model, trained on the same OpenThoughts3-1.2M dataset.
📢📢📢 Releasing OpenThinker3-1.5B, the top-performing SFT-only model at the 1B scale! 🚀 OpenThinker3-1.5B is a smaller version of our previous 7B model, trained on the same OpenThoughts3-1.2M dataset.
Flexibility and ease of evals is one of the most important drivers of ML science progress!
Evaluating agents on benchmarks is a pain. Each benchmark comes with its own harness, scoring scripts, and environments and integrating can take days. We're introducing the Terminal-Bench dataset registry to solve this problem. Think of it as the npm of agent benchmarks. Now…
Evaluating agents on benchmarks is a pain. Each benchmark comes with its own harness, scoring scripts, and environments and integrating can take days. We're introducing the Terminal-Bench dataset registry to solve this problem. Think of it as the npm of agent benchmarks. Now…
I’ll be at ICML so if anyone wants to chat about data, reasoning, agents, or the best Indian food in Vancouver, let me know! 🇨🇦
18 months in, AI isn’t eating SaaS—it’s eating the $4.6 TRILLION services budget. Our latest blog tracks the first year of Services‑as‑Software companies and distills lessons for founders
Models being good at math typically means you’re good at code and science but being good at code doesn’t super strongly predict being good at science! These over 1,000 evaluations from OpenThoughts uncover neat correlations and patterns of downstream model performance! Check out…
We evaluated more than 1000 reasoning LLMs on 12 reasoning-focused benchmarks and made fascinating observations about cross-benchmark comparisons. You can explore all that data yourself on our HuggingFace spaces page. (1/4)
We evaluated more than 1000 reasoning LLMs on 12 reasoning-focused benchmarks and made fascinating observations about cross-benchmark comparisons. You can explore all that data yourself on our HuggingFace spaces page. (1/4)
How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning…