Ofir Press
@OfirPress
I build tough benchmarks for LMs and then I get the LMs to solve them. Postdoc @Princeton. SWE-bench & SWE-agent. PhD w/ @nlpnoah @UW.
How we build SWE-bench & SWE-agent at Princeton and an overview of our recent and upcoming work: youtube.com/watch?v=yAQw77…

Releasing mini, a radically simple SWE-agent: 100 lines of code, 0 special tools, and gets 65% on SWE-bench verified! Made for benchmarking, fine-tuning, RL, or just for use from your terminal. It’s open source, simple to hack, and compatible with any LM! Link in 🧵
Congrats @huybery and team on the great numbers on SWE-bench Verified and SWE-bench Multilingual!!
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…
SWE-agent EnIGMA is at ICML, go talk to Talor and Minghao about agents for cybersecurity! x.com/abramovichtalo…
We’re @icmlconf Hall B2-B3, Poster W-101 right now to talk about our agent for discovering new cybersec vulns. Come chat with us!
We’re @icmlconf Hall B2-B3, Poster W-101 right now to talk about our agent for discovering new cybersec vulns. Come chat with us!
Excited to share our new Community Alignment project! We introduce 1) a new alignment dataset, 2) a way of measuring the social/cultural values of LM responses, and 3) models tuned to these values A quick 🧵
Today we're releasing Community Alignment - the largest open-source dataset of human preferences for LLMs, containing ~200k comparisons from >3000 annotators in 5 countries / languages! There was a lot of research that went into this... 🧵
The SWE-bench / SWE-agent team agrees with @polynoamial on this, and we'll have two small launches on Thursday to push this line of thought forward. Can you guess what they'll be? x.com/latentspacepod…
Noam Brown from OpenAI just dropped a truth bomb: "Your fancy AI scaffolds will be washed away by scale" Routers, harnesses, complex agentic systems... all getting replaced by models that just work better out of the box The reasoning models already proved this
These Verified numbers are similar to the ones Claude 3.7 achieved at the end of February 2025. Seems like the open-closed source gap is less than half a year right now.
Congrats to the Kimi team on the super strong SWE-bench Verified and SWE-bench Multilingual numbers!!
Congrats to the Kimi team on the super strong SWE-bench Verified and SWE-bench Multilingual numbers!!

SWE-agent is now Multimodal! 😎 We're releasing SWE-agent Multimodal, with image-viewing abilities and a full web browser for debugging front-ends. Evaluate your LMs on SWE-bench Multimodal or use it yourself for front-end dev. 🔗➡️