Etash Guha (@etash_guha)

Pinned

E

Etash Guha@etash_guha · Apr 3

Turns out, it’s possible to outperform DeepSeekR1-32B with only SFT on open data and no RL: Announcing OpenThinker2-32B and OpenThinker2-7B. We also release the data, OpenThoughts2-1M, curated by selecting quality instructions from diverse sources. 🧵 (1/n)

etash_guha's tweet image. Turns out, it’s possible to outperform DeepSeekR1-32B with only SFT on open data and no RL: Announcing OpenThinker2-32B and OpenThinker2-7B. We also release the data, OpenThoughts2-1M, curated by selecting quality instructions from diverse sources. 🧵 (1/n)

19

137

464

333

86.0K

E

Etash Guha@etash_guha · Jul 23

Qwen3 Coder highlights that data scaling is critical for code RL. They ran 20,000 environments in parallel. Most OSS RL datasets have only 20,000 datapoints total. This data may be one of the biggest gaps between what frontier labs and academic labs have.

etash_guha's tweet image. Qwen3 Coder highlights that data scaling is critical for code RL. They ran 20,000 environments in parallel. Most OSS RL datasets have only 20,000 datapoints total. This data may be one of the biggest gaps between what frontier labs and academic labs have.

0

1

9

0

648

E

Etash Guha@etash_guha · Jul 22

The set of agentic evals that Qwen/Anthropic look at are so different than the reasoning evals that OpenAI/GDM work on. I wonder whether these different bets will converge again or diverge more.

QQwen@Alibaba_Qwen · Jul 22

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…

0

13

3

810

Etash Guha Retweeted

Q

Qwen@Alibaba_Qwen · Jul 22

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…

266

1.0K

9.0K

4.0K

1.8M

E

Etash Guha@etash_guha · Jul 21

Quick Update: I’ve officially started my PhD at Stanford (go trees i think?!?)! After an amazing year at UW (go huskies!), I’m super happy to continue my CS PhD with my amazing advisors @lschmidt3 and @YejinChoinka! If you see me on campus, please say hi and listen to me rant…

16

3

155

11

15.0K

Etash Guha Retweeted

�

𝚐𝔪𝟾𝚡𝚡𝟾@gm8xx8 · Jul 19

OpenThinker3-1.5B: a compact reasoning model fine-tuned from Qwen2.5-1.5B-Instruct on OpenThoughts3-1.2M, a filtered dataset of math, code, and science QA. - +10.1 avg over R1-Distill-1.5B across math/code/science tasks - Within 2pts of Qwen3-1.7B (closed data) - SOTA at 1.5B…

1

18

8

2.0K

E

Etash Guha@etash_guha · Jul 18

Smaller models can reason very well!

SSedrick Keh@sedrickkeh2 · Jul 18

📢📢📢 Releasing OpenThinker3-1.5B, the top-performing SFT-only model at the 1B scale! 🚀 OpenThinker3-1.5B is a smaller version of our previous 7B model, trained on the same OpenThoughts3-1.2M dataset.

0

8

1

644

Etash Guha Retweeted

S

Sedrick Keh@sedrickkeh2 · Jul 18

📢📢📢 Releasing OpenThinker3-1.5B, the top-performing SFT-only model at the 1B scale! 🚀 OpenThinker3-1.5B is a smaller version of our previous 7B model, trained on the same OpenThoughts3-1.2M dataset.

1

30

112

29

10.0K

E

Etash Guha@etash_guha · Jul 16

Flexibility and ease of evals is one of the most important drivers of ML science progress!

AAlex Shaw@alexgshaw · Jul 16

Evaluating agents on benchmarks is a pain. Each benchmark comes with its own harness, scoring scripts, and environments and integrating can take days. We're introducing the Terminal-Bench dataset registry to solve this problem. Think of it as the npm of agent benchmarks. Now…

0

3

0

292

Etash Guha Retweeted

A

Alex Shaw@alexgshaw · Jul 16

Evaluating agents on benchmarks is a pain. Each benchmark comes with its own harness, scoring scripts, and environments and integrating can take days. We're introducing the Terminal-Bench dataset registry to solve this problem. Think of it as the npm of agent benchmarks. Now…

1

20

83

39

9.0K

E

Etash Guha@etash_guha · Jul 12

I’ll be at ICML so if anyone wants to chat about data, reasoning, agents, or the best Indian food in Vancouver, let me know! 🇨🇦

3

1

25

5

3.0K

Etash Guha Retweeted

J

Jaya Gupta@JayaGup10 · Jul 3

18 months in, AI isn’t eating SaaS—it’s eating the $4.6 TRILLION services budget. Our latest blog tracks the first year of Services‑as‑Software companies and distills lessons for founders

9

21

216

204

24.0K

E

Etash Guha@etash_guha · Jun 24

Models being good at math typically means you’re good at code and science but being good at code doesn’t super strongly predict being good at science! These over 1,000 evaluations from OpenThoughts uncover neat correlations and patterns of downstream model performance! Check out…

JJean Mercat@MercatJean · Jun 24

We evaluated more than 1000 reasoning LLMs on 12 reasoning-focused benchmarks and made fascinating observations about cross-benchmark comparisons. You can explore all that data yourself on our HuggingFace spaces page. (1/4)

0

6

2

450

Etash Guha Retweeted

J

Jean Mercat@MercatJean · Jun 24

We evaluated more than 1000 reasoning LLMs on 12 reasoning-focused benchmarks and made fascinating observations about cross-benchmark comparisons. You can explore all that data yourself on our HuggingFace spaces page. (1/4)

2

18

94

71

15.0K

Etash Guha Retweeted

J

Jon Saad-Falcon@JonSaadFalcon · Jun 24

How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning…

11

62

216

163

49.0K