CLS
@ChengleiSi
PhDing @stanfordnlp | teaching language models to do research | real AGI is the friends we made along the way
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

🚀 Introducing DeepSWE 🤖: our fully open-sourced, SOTA software engineering agent trained purely with RL on top of Qwen3-32B. DeepSWE achieves 59% on SWEBench-Verified with test-time scaling (and 42.2% Pass@1), topping the SWEBench leaderboard for open-weight models. 💪DeepSWE…
IOI 2025 is next week. How many AI teams will get a gold medal?
Life Update: I will join @UTiSchool as an Assistant Professor in Fall 2026 and will continue my work on LLM, HCI, and Computational Social Science. I'm building a new lab on Human-Centered AI Systems and will be hiring PhD students in the coming cycle!
I'm sadly not at #IC2S2 😭, but I will be at #ACL2025 in Vienna ☕️ next week!! Please spread the word that I'm recruiting prospective PhD students: lucy3.notion.site/for-prospectiv…
Watching the model solve these IMO problems and achieve gold-level performance was magical. A few thoughts 🧵
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).
TL;DR: When you add a system prompt asking the model to act "based", it might act based.
Update on where has @grok been & what happened on July 8th. First off, we deeply apologize for the horrific behavior that many experienced. Our intent for @grok is to provide helpful and truthful responses to users. After careful investigation, we discovered the root cause…
When are AI-designed drugs making it to patients? No AI-designed medicine is on pharmacy shelves yet, but the first wave of molecules now in Phase 2/early Phase 3 (chiefly rentosertib for IPF). Here's where we are, a thread👇
Are you a researcher, trying to build a small GPU cluster? Did you already build one, and it sucks? I manage USC NLP’s GPU cluster and I’m happy to offer my expertise. I hope I can save you some headaches and make some friends. Please reach out!
Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data…
Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵
People are racing to push math reasoning performance in #LLMs—but have we really asked why? The common assumption is that improving math reasoning should transfer to broader capabilities in other domains. But is that actually true? In our study (arxiv.org/pdf/2507.00432), we…
July 4th break in our #AI4Science seminar series. Join us next week for a talk by @ChengleiSi on the epic 2-year experiment evaluating (and executing!) AI-generated scientific ideas. lu.ma/9qq72ebt
LLMs can generate research ideas that look more novel than humans’, but are they actually better? Stanford ran a study where LLM- or human-authored ideas were tested Human ideas were blindly rated consistently better, with LLM ideas seeing 37× larger score drops post-execution
ChengLei has the most creative research projects: he made PhDs execute AI research ideas for months
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
Amazing follow-up work! After finding that AI research ideas were judged (by human experts) better than human ideas... They tested it by actually executing the research projects! Turns out human ideas are better (judges were wrong!) – but only narrowly & not statistically…
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
“Finally, maybe this is controversial but ultimately progress in science is bottlenecked by real-world experiments.” If this is controversial in SF, we’re cooked.
We don’t have AI self-improves yet, and when we do it will be a game-changer. With more wisdom now compared to the GPT-4 days, it's obvious that it will not be a “fast takeoff”, but rather extremely gradual across many years, probably a decade. The first thing to know is that…
Can AI ideas hold up in the lab? This study from Stanford says not as well as human ones, but there's hope. With enough training/reasoning, I'm pretty sure LLMs could nail 'small-scale discoveries' not Nobel stuff though Great work @ChengleiSi @tatsu_hashimoto @Diyi_Yang
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
guys LLMs are trailing by barely half a peer-review point at doing research
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
LLM research ideas look shiny on paper but slip when someone actually builds the project. This 131-page work checks whether those projects still look strong once experts run every experiment. It shows a clear drop in quality for the LLM ideas, which means judging ideas only at…