John Yang

@jyangballin

🌲 CS PhD @Stanford 🤖 SWE-bench + agent + smith 🆕 🎓 Prev. @princeton_nlp 🐯; @Berkeley_EECS 🐻

SF Bay Area

Joined June 2012

776Following

4KFollowers

Pinned

John Yang@jyangballin · May 7

40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.

jyangballin's tweet image. 40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified.

We built it by synthesizing a ton of agentic training data from 100+ Python repos.

Today we’re open-sourcing the toolkit that made it happen: SWE-smith.

133

654

379

97.0K

John Yang@jyangballin · 11 h

If you're like me and tired of grappling with huge agent scaffolds, check this out. @KLieret put a ton of thought into this. The simplicity is elegant and powerful. If you wanna get started + have questions, join SWE-bench slack! (link at bottom left of swebench.com)

KKilian Lieret@KLieret · 14 h

Releasing mini, a radically simple SWE-agent: 100 lines of code, 0 special tools, and gets 65% on SWE-bench verified! Made for benchmarking, fine-tuning, RL, or just for use from your terminal. It’s open source, simple to hack, and compatible with any LM! Link in 🧵

1.0K

John Yang@jyangballin · Jul 22

🤯 Wow!

SSWE-bench@SWEbench · Jul 22

🎉 Congrats @Alibaba_Qwen @huybery @JustinLin610 and the Qwen team! Incredible progress in the last year, love to see Qwen continue championing open models for SWE-bench!

677

John Yang Retweeted

Alex Shaw@alexgshaw · Jul 16

Evaluating agents on benchmarks is a pain. Each benchmark comes with its own harness, scoring scripts, and environments and integrating can take days. We're introducing the Terminal-Bench dataset registry to solve this problem. Think of it as the npm of agent benchmarks. Now…

9.0K

John Yang Retweeted

Keyon Vafa@keyonV · Jul 11

Can an AI model predict perfectly and still have a terrible world model? What would that even mean? Our new ICML paper formalizes these questions One result tells the story: A transformer trained on 10M solar systems nails planetary orbits. But it botches gravitational laws 🧵

213

1.0K

7.0K

5.0K

1.3M

John Yang Retweeted

SWE-bench@SWEbench · Jul 11

SWE-agent is now Multimodal! 😎 We're releasing SWE-agent Multimodal, with image-viewing abilities and a full web browser for debugging front-ends. Evaluate your LMs on SWE-bench Multimodal or use it yourself for front-end dev. 🔗➡️

2.0K

John Yang Retweeted

Talor Abramovich @ ICML@AbramovichTalor · Jul 9

Join me next week at #ICML25, where I will be presenting my first first-author paper –– EnIGMA. EnIGMA, an LM agent for cybersecurity, uses interactive tools for server connection and debugging, achieving state-of-the-art on 3 CTF benchmarks. youtube.com/watch?v=50zkWJ…

9.0K

John Yang Retweeted

Ori Press@ori_press · Jul 2

Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️

154

22.0K

John Yang@jyangballin · Jun 30

Our study led by @ChengleiSi reveals an “ideation–execution gap” 😲 Ideas from LLMs may sound novel, but when experts spend 100+ hrs executing them, they flop: 💥 👉 human‑generated ideas outperform on novelty, excitement, effectiveness & overall quality!

CCLS@ChengleiSi · Jun 30

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

151

25.0K

John Yang Retweeted

CLS@ChengleiSi · Jun 30

169

597

204

138.0K

John Yang@jyangballin · Jun 26

So about a month ago, Percy posted a version of this plot of our Marin 32B pretraining run. We got a lot of feedback, both public and private, that the spikes were bad. (This is a thread about how we fixed the spikes. Bear with me. )

PPercy Liang@percyliang · May 22

Marin 32B training crossed 1.5 trillion tokens today...

101

1.0K

294.0K

John Yang@jyangballin · Jun 25

If you wanna stay up to date with SWE-bench leaderboard updates, follow our new Twitter account! And if you're bored of SWE-bench Verified, check out SWE-bench Multimodal; +25% progress over the last 9 months.

SSWE-bench@SWEbench · Jun 25

We just updated the SWE-bench Multimodal leaderboard with new systems from @refact_ai, @allhands_ai and @TU_Muenchen. Congrats to all teams on pushing the state-of-the-art performance! SWE-bench Multimodal challenges AI systems to fix issues that are described using screenshots.

786

John Yang Retweeted

Yutong Zhang@zhangyt0704 · Jun 18

AI companions aren’t science fiction anymore 🤖💬❤️ Thousands are turning to AI chatbots for emotional connection – finding comfort, sharing secrets, and even falling in love. But as AI companionship grows, the line between real and artificial relationships blurs. 📰 “Can A.I.…

191

139

49.0K

John Yang@jyangballin · Jun 12

The top SWE agent is not Cursor or Windsurf, it's two tools that can be downloaded from GitHub; OpenHands (@allhands_ai) and SWE-Agent. Btw SWE-Agent does have a X handle but looks faked or hacked. Check link below to the LiveSWEBench benchmark and the links to the real agents.

ssearch founder@n0riskn0r3ward · Apr 19

FWIW there is a benchmark comparing Claude Code (X's fav), Cursor (X's old fav), Windsurf, Github Copilot's new agent mode, and Aider with the two best agentic coding tools no one ever mentions here (bc they're not being marketed by the big labs), SWE-Agent and OpenHands... 🧵

4.0K

John Yang Retweeted

Yijia Shao@EchoShao8899 · Jun 12

🚨 70 million US workers are about to face their biggest workplace transmission due to AI agents. But nobody asks them what they want. While AI races to automate everything, we took a different approach: auditing what workers want vs. what AI can do across the US workforce.🧵

132

666

723

105.0K

John Yang Retweeted

Ben Shi@BenShi34 · Jun 10

As we optimize model reasoning over verifiable objectives, how does this affect human understanding of said reasoning to achieve superior collaborative outcomes? In our new preprint, we investigate human-centric model reasoning for knowledge transfer 🧵:

179

135

21.0K

John Yang Retweeted

Omar Shaikh@oshaikh13 · Jun 9

What if LLMs could learn your habits and preferences well enough (across any context!) to anticipate your needs? In a new paper, we present the General User Model (GUM): a model of you built from just your everyday computer use. 🧵

336

197

57.0K

John Yang Retweeted

Ludwig Schmidt@lschmidt3 · Jun 5

Very excited to finally release our paper for OpenThoughts! After DataComp and DCLM, this is the third large open dataset my group has been building in collaboration with the DataComp community. This time, the focus is on post-training, specifically reasoning data.

212

1.0K

875

167.0K

John Yang@jyangballin · May 31

To find "good" GitHub repositories (good = well structured, lots of activity) for some language, I just use GitHub search (e.g. `language:go`), click "repositories", then sort search results by "Most stars". Feels kind of primitive, are there better ways to do this?

jyangballin's tweet image. To find "good" GitHub repositories (good = well structured, lots of activity) for some language, I just use GitHub search (e.g. `language:go`), click "repositories", then sort search results by "Most stars".

Feels kind of primitive, are there better ways to do this?

2.0K

John Yang Retweeted

CLS@ChengleiSi · May 30

This year, there have been various pieces of evidence that AI agents are starting to be able to conduct scientific research and produce papers end-to-end, at a level where some of these generated papers were already accepted by top-tier conferences/workshops. Intology’s…

220

36.0K

John Yang Retweeted

Qinan Yu@qinan_yu · May 29

🎀 fine-grained, interpretable representation steering for LMs! meet RePS — Reference-free Preference Steering! 1⃣ outperforms existing methods on 2B-27B LMs, nearly matching prompting 2⃣ supports both steering and suppression (beat system prompts!) 3⃣ jailbreak-proof (1/n)

226

175

35.0K