Ofir Press

@OfirPress

I build tough benchmarks for LMs and then I get the LMs to solve them. Postdoc @Princeton. SWE-bench & SWE-agent. PhD w/ @nlpnoah @UW.

Joined June 2016

6KFollowing

14KFollowers

Pinned

Ofir Press@OfirPress · May 21

How we build SWE-bench & SWE-agent at Princeton and an overview of our recent and upcoming work: youtube.com/watch?v=yAQw77…

OfirPress's tweet image. How we build SWE-bench &amp; SWE-agent at Princeton and an overview of our recent and upcoming work:

youtube.com/watch?v=yAQw77…

7.0K

Ofir Press Retweeted

Kilian Lieret@KLieret · Jul 24

Releasing mini, a radically simple SWE-agent: 100 lines of code, 0 special tools, and gets 65% on SWE-bench verified! Made for benchmarking, fine-tuning, RL, or just for use from your terminal. It’s open source, simple to hack, and compatible with any LM! Link in 🧵

726

836

90.0K

Ofir Press@OfirPress · Jul 22

Congrats @huybery and team on the great numbers on SWE-bench Verified and SWE-bench Multilingual!!

QQwen@Alibaba_Qwen · Jul 22

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…

2.0K

Ofir Press@OfirPress · Jul 22

AGI

2.0K

Ofir Press@OfirPress · Jul 17

SWE-agent EnIGMA is at ICML, go talk to Talor and Minghao about agents for cybersecurity! x.com/abramovichtalo…

TTalor Abramovich @ ICML@AbramovichTalor · Jul 17

We’re @icmlconf Hall B2-B3, Poster W-101 right now to talk about our agent for discovering new cybersec vulns. Come chat with us!

1.0K

Ofir Press Retweeted

Talor Abramovich @ ICML@AbramovichTalor · Jul 17

We’re @icmlconf Hall B2-B3, Poster W-101 right now to talk about our agent for discovering new cybersec vulns. Come chat with us!

2.0K

Ofir Press@OfirPress · Jul 16

Excited to share our new Community Alignment project! We introduce 1) a new alignment dataset, 2) a way of measuring the social/cultural values of LM responses, and 3) models tuned to these values A quick 🧵

ssmitha milli@SmithaMilli · Jul 16

Today we're releasing Community Alignment - the largest open-source dataset of human preferences for LLMs, containing ~200k comparisons from >3000 annotators in 5 countries / languages! There was a lot of research that went into this... 🧵

3.0K

Ofir Press@OfirPress · Jul 15

The SWE-bench / SWE-agent team agrees with @polynoamial on this, and we'll have two small launches on Thursday to push this line of thought forward. Can you guess what they'll be? x.com/latentspacepod…

LLatent.Space@latentspacepod · Jul 13

Noam Brown from OpenAI just dropped a truth bomb:⁣ ⁣ "Your fancy AI scaffolds will be washed away by scale"⁣ ⁣ Routers, harnesses, complex agentic systems... all getting replaced by models that just work better out of the box⁣ ⁣ The reasoning models already proved this

2.0K

Ofir Press@OfirPress · Jul 14

These Verified numbers are similar to the ones Claude 3.7 achieved at the end of February 2025. Seems like the open-closed source gap is less than half a year right now.

OOfir Press@OfirPress · Jul 14

Congrats to the Kimi team on the super strong SWE-bench Verified and SWE-bench Multilingual numbers!!

2.0K

Ofir Press@OfirPress · Jul 14

Congrats to the Kimi team on the super strong SWE-bench Verified and SWE-bench Multilingual numbers!!

8.0K

Ofir Press Retweeted

SWE-bench@SWEbench · Jul 11

SWE-agent is now Multimodal! 😎 We're releasing SWE-agent Multimodal, with image-viewing abilities and a full web browser for debugging front-ends. Evaluate your LMs on SWE-bench Multimodal or use it yourself for front-end dev. 🔗➡️

2.0K