Naman Jain

@StringChaos

@UCBerkeley PhD | Looking for Jobs | Projects - LiveCodeBench, DeepSWE, R2E-Gym, GSO, Syzygy, LMArena Coding | Past: @MetaAI @AWS @MSFTResearch @iitbombay

Berkeley

Joined March 2018

1KFollowing

2KFollowers

Pinned

Naman Jain@StringChaos · Apr 9

Excited to release R2E-Gym - 🔥 8.1K executable environments using synthetic data - 🧠 Hybrid verifiers for enhanced inference-time scaling - 📈 51% success-rate on the SWE-Bench Verified - 🤗 Open Source Data + Models + Trajectories 1/

StringChaos's tweet image. Excited to release R2E-Gym
- 🔥 8.1K executable environments using synthetic data
- 🧠 Hybrid verifiers for enhanced inference-time scaling
- 📈 51% success-rate on the SWE-Bench Verified
- 🤗 Open Source Data + Models + Trajectories

1/

258

115

44.0K

Pinned

Naman Jain@StringChaos · Jul 3

DeepSWE is a new state-of-the-art open-source software engineering model trained entirely using reinforcement learning, based on Qwen3-32B. together.ai/blog/deepswe Fantastic work from @togethercompute @Agentica_‼

TTogether AI@togethercompute · Jul 2

Announcing DeepSWE 🤖: our fully open-sourced, SOTA software engineering agent trained purely with RL on top of Qwen3-32B. DeepSWE achieves 59% on SWEBench-Verified with test-time scaling (and 42.2% Pass@1), topping the SWEBench leaderboard for open-weight models. Built in…

262

152

40.0K

Naman Jain@StringChaos · Jul 22

After three intense months of hard work with the team, we made it! We hope this release can help drive the progress of Coding Agents. Looking forward to seeing Qwen3-Coder continue creating new possibilities across the digital world!

QQwen@Alibaba_Qwen · Jul 22

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…

938

108

61.0K

Naman Jain Retweeted

Jasper Dekoninck@j_dekoninck · Jul 17

We just released the evaluation of LLMs on the 2025 IMO on MathArena! Gemini scores best, but is still unlikely to achieve the bronze medal with its 31% score (13/42). 🧵(1/4)

220

35.0K

Naman Jain@StringChaos · Jul 9

Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data…

AAi2@allen_ai · Jul 9

Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵

269

52.0K

Naman Jain@StringChaos · Jul 3

It's easy to confuse Best@K vs Pass@K—and we've seen some misconceptions about our results. Our 59% on SWEBench-Verified is Pass@1 with Best@16, not Pass@8/16. Our Pass@8/16 is 67%/71%. So how did we achieve this? DeepSWE generates N candidate solutions. Then, another LLM…

CCasper Hansen@casper_hansen_ · Jul 3

Is it malpractice to report SOTA with pass@8 without using other models at pass@8 or just standard practice at this point? It's clearly not SOTA if it's behind Devstral in a pass@1

23.0K

Naman Jain@StringChaos · Jun 27

Claude is hyped to hear that its small business is getting the public recognition it deserves

AAnthropic@AnthropicAI · Jun 27

New Anthropic Research: Project Vend. We had Claude run a small shop in our office lunchroom. Here’s how it went.

781

110

83.0K

Naman Jain Retweeted

Kexun Zhang@kexun_zhang · Jun 12

RLVR is not just about RL, it's more about VR! Particularly for LLM coding, good verifiers (tests) are hard to get! In our latest work, we ask 3 questions: How good are current tests? How do we get better tests? How much does test quality matter? leililab.github.io/HardTests/

6.0K

Naman Jain@StringChaos · Jun 12

Questions to ask 1. can we see a "commit history?" (only 2 commits in repo) 2. what level of supervision was provided? 3. the paper is 4 dense pages. was it outlined first in a lean friendly way and then formalization took place? 4. which models were used in the formalization?

MMorph@morph_labs · Jun 12

We are excited to announce Trinity, an autoformalization system for verified superintelligence that we have developed at @morph_labs. We have used it to automatically formalize in Lean a classical result of de Bruijn that the abc conjecture is true almost always.…

146

48.0K

Naman Jain Retweeted

Bespoke Labs@bespokelabsai · Jun 11

Day 3 of drilling down into popular benchmarks for models/agents. Benchmark #3: LiveCodeBench Developed by researchers at UC Berkeley, MIT, and Cornell, this benchmark evaluates LLM code-generation skills and continually expands with new problems drawn from programming contests…

979

Naman Jain@StringChaos · Jun 11

We ran this eval yesterday before price drop 😆🫠 @OpenAI

MManish Shetty@slimshetty_ · Jun 10

📣 Exciting first GSO leaderboard update! @OpenAI o3 now ranks #1 setting the new SOTA at 8.8%!!

582

Naman Jain Retweeted

Aditya Kanade@adityakanade0 · Jun 9

Introducing Code Researcher - a deep research agent for large systems code and commit history. aka.ms/coderesearcher Achieves a 58% crash resolution rate on a benchmark of crashes in the Linux kernel, a complex codebase with 28M LOC & 75K files.

467

406

39.0K

Naman Jain@StringChaos · Jun 9

Ensuring construct validity is becoming increasingly more complex as we move towards more real-world evaluation setups. We should routinely inspect benchmark solutions to ensure intended goal is being met!!

EEpoch AI@EpochAIResearch · Jun 8

How do reasoning models solve hard math problems? We asked 14 mathematicians to review o3-mini-high’s raw, unsummarized reasoning traces on 29 FrontierMath problems. Here’s what they found:

670