Wenting Zhao

@wzhao_nlp

working on reasoning & llms hanging out at @AIatMeta, incoming assistant prof @UMassAmherst phd from @cornell_tech

NYC

Joined June 2013

582Following

3KFollowers

Pinned

Wenting Zhao@wzhao_nlp · May 13

Some personal news: I'll join @UMassAmherst CS as an assistant professor in fall 2026. Until then, I'll postdoc at @Meta nyc. Reasoning will continue to be my main interest, with a focus on data-centric approaches🤩 If you're also interested, apply to me (phds & a postdoc)!

851

69.0K

Wenting Zhao Retweeted

Run-Ze Fan@Vfrz525_ · Jul 23

🚨 New release: MegaScience The largest & highest-quality post-training dataset for scientific reasoning is now open-sourced (1.25M QA pairs)! 📈 Trained models outperform official Instruct baselines 🔬 Covers 7+ disciplines with university-level textbook-grade QA 📄 Paper:…

251

136

19.0K

Wenting Zhao@wzhao_nlp · Jul 18

I'll be around the ICML venue this afternoon. Message me if you want to meet! These days, I think about reasoning and RL. Also happy to talk about academia vs. industry (I think the lack of compute in academia is a feature not a bug), faculty and PhD student recruiting at UMass.

118

14.0K

Wenting Zhao Retweeted

Justin T Chiu@justintchiu · Jul 15

haven't made a new blog post in over a year, so here's a new one: justintchiu.com/blog/sftrl/ it's short

178

174

15.0K

Wenting Zhao Retweeted

Yoram Bachrach@yorambac · Jul 7

AI Research Agents are becoming proficient at machine learning tasks, but how can we help them search the space of candidate solutions and codebases? Read our new paper looking at MLE-Bench: arxiv.org/pdf/2507.02554 #LLM #Agents #MLEBench

314

168

28.0K

Wenting Zhao Retweeted

Michael Hu ✈️ ACL 2025 🇦🇹@michahu8 · Jul 2

📢 today's scaling laws often don't work for predicting downstream task performance. For some pretraining setups, smooth and predictable scaling is the exception, not the rule. a quick read about scaling law fails: 📜arxiv.org/abs/2507.00885 🧵1/5👇

283

193

28.0K

Wenting Zhao Retweeted

Ori Press@ori_press · Jul 2

Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️

155

22.0K

Wenting Zhao Retweeted

Jason Wei@_jasonwei · Jun 30

We don’t have AI self-improves yet, and when we do it will be a game-changer. With more wisdom now compared to the GPT-4 days, it's obvious that it will not be a “fast takeoff”, but rather extremely gradual across many years, probably a decade. The first thing to know is that…

165

1.0K

696

366.0K

Wenting Zhao@wzhao_nlp · Jun 30

Congrats to team! They built my dream benchmark.

MMinqi Jiang@MinqiJiang · Jun 30

Recently, there has been a lot of talk of LLM agents automating ML research itself. If Llama 5 can create Llama 6, then surely the singularity is just around the corner. How can we get a pulse check on whether current LLMs are capable of driving this kind of total…

2.0K

Wenting Zhao Retweeted

NovaSky@NovaSkyAI · Jun 26

✨Release: We upgraded SkyRL into a highly-modular, performant RL framework for training LLMs. We prioritized modularity—easily prototype new algorithms, environments, and training logic with minimal overhead. 🧵👇 Blog: novasky-ai.notion.site/skyrl-v01 Code: github.com/NovaSky-AI/Sky…

204

119

36.0K

Wenting Zhao@wzhao_nlp · Jun 26

Dang, truly impressed by how an academic lab just figured out a lot of mysteries in mid-training to close the RL gap between llama and qwen: * length scheduler plays a key role to stabilize RL * there is some dark magic in prompt template? * the data interaction stuff is really…

ZZengzhi Wang@SinclairWang1 · Jun 26

What Makes a Base Language Model Suitable for RL? Rumors in the community say RL (i.e., RLVR) on LLMs is full of “mysteries”: (1) Is the magic only happening on Qwen + Math? (2) Does the "aha moment" only spark during math reasoning? (3) Is evaluation hiding some tricky traps?…

196

178

23.0K

Wenting Zhao@wzhao_nlp · Jun 24

LM training bottlenecks 2024: code RL -> code execution is slower than model inference 2025: reasoning model RL -> rolling out 32k tokens takes forever maybe diffusion models are indeed the solution lol

109

13.0K

Wenting Zhao@wzhao_nlp · Jun 23

It's time to think about code generation beyond functional correctness. Refactoring multiple libraries requires designing APIs that support past and future use cases, which is challenging for even human engineers. Can't wait for LLMs to unify pytorch, tensorflow, and jax 😬

JJustin T Chiu@justintchiu · Jun 23

Are code agents good at software design, ie building general and reusable code? We present Librarian, a new refactoring method, and MiniCode, a verifiable refactoring benchmark that requires agents to design libraries that jointly minimizes code from multiple repos 🧵

5.0K

Wenting Zhao@wzhao_nlp · Jun 20

The more I dive into LM training, the more I feel pretraining is just starting. Some questions I’m particularly interested in: * what data unlocks what capabilities? * do we train on capabilities sequentially or in parallel? * how many synthetic examples is a human example worth?

AAndrej Karpathy@karpathy · Jun 20

Mildly obsessed with what the "highest grade" pretraining data stream looks like for LLM training, if 100% of the focus was on quality, putting aside any quantity considerations. Guessing something textbook-like content, in markdown? Or possibly samples from a really giant model?…

334

174

39.0K

Wenting Zhao@wzhao_nlp · Jun 17

That’s the vision of commit0: github.com/commit-0/commi… there is nearly zero improvement on this benchmark in the past few months. I don’t think this problem is solvable in 24 months…

jjacob╞@jsnnsa · Jun 16

cursor is a $100M business that will be worth $0 in 24 months not because they built wrong - they built perfectly but they built a sail for a race that's about to end when AI just writes entire codebases, even the best IDE becomes irrelevant

3.0K

Wenting Zhao Retweeted

Alex Dimakis@AlexGDimakis · Jun 10

There are still posts about 'new papers showing AI models cannot reason'. There are unfortunately problems into how these evaluations were done and also many of those limitations are known, peer-reviewed and published. Here is a simplified version of what's going on as far as I…

146

14.0K

Wenting Zhao@wzhao_nlp · Jun 9

Where does one language model outperform the other? We examine this from first principles, performing unsupervised discovery of "abilities" that one model has and the other does not. Results show interesting differences between model classes, sizes and pre-/post-training.

LLindia Tjuatja @ ACL 2025@lltjuatja · Jun 9

When it comes to text prediction, where does one LM outperform another? If you've ever worked on LM evals, you know this question is a lot more complex than it seems. In our new #acl2025 paper, we developed a method to find fine-grained differences between LMs: 🧵1/9

7.0K

Wenting Zhao Retweeted

John Hewitt@johnhewtt · Jun 8

I wrote a note on linear transformations and symbols that traces a common conversation/interview I've had with students. Outer products, matrix rank, eigenvectors, linear RNNs -- the topics are really neat, and lead to great discussions of intuitions. cs.columbia.edu/~johnhew//fun-…

230

224

20.0K

Wenting Zhao Retweeted

Jiaxin Wen@jiaxinwen22 · Jun 3

Most promising-looking AI research ideas don’t pan out, but testing them burns through compute and labor. Can LMs predict idea success without running any experiments? We show that they do it better than human experts!

375

255

72.0K