Zhiyuan Zeng (@ZhiyuanZeng_)

Pinned

Z

Zhiyuan Zeng@ZhiyuanZeng_ · Mar 14

Is a single accuracy number all we can get from model evals?🤔 🚨Does NOT tell where the model fails 🚨Does NOT tell how to improve it Introducing EvalTree🌳 🔍identifying LM weaknesses in natural language 🚀weaknesses serve as actionable guidance (paper&demo 🔗in🧵) [1/n]

ZhiyuanZeng_'s tweet image. Is a single accuracy number all we can get from model evals?🤔
🚨Does NOT tell where the model fails
🚨Does NOT tell how to improve it

Introducing EvalTree🌳
🔍identifying LM weaknesses in natural language
🚀weaknesses serve as actionable guidance

(paper&amp;demo 🔗in🧵)

[1/n]

5

90

262

147

60.0K

Zhiyuan Zeng Retweeted

A

Ai2@allen_ai · Jul 22

In our new paper, “Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries,” we find that adding just a bit of missing context can reorder model leaderboards—and surface hidden biases. 🧵👇

5

27

159

75

13.0K

Zhiyuan Zeng Retweeted

S

Stella Li@StellaLisy · Jul 22

WHY do you prefer something over another? Reward models treat preference as a black-box😶‍🌫️but human brains🧠decompose decisions into hidden attributes We built the first system to mirror how people really make decisions in our #COLM2025 paper🎨PrefPalette✨ Why it matters👉🏻🧵

6

69

365

259

40.0K

Zhiyuan Zeng Retweeted

A

Ai2@allen_ai · Jul 18

Great science starts with great questions. 🤔✨ Meet AutoDS—an AI that doesn’t just hunt for answers, it decides which questions are worth asking. 🧵

2

38

332

243

44.0K

Zhiyuan Zeng Retweeted

A

Akari Asai@AkariAsai · Jul 15

Some updates 🚨 I finished my Ph.D at @uwcse in June 2025! After a year at AI2 as a Research Scientist, I am joining CMU @LTIatCMU & @mldcmu (courtesy) as an Assistant Professor in Fall 2026. The journey, acknowledgments & recruiting in 🧵

113

61

1.0K

107

103.0K

Z

Zhiyuan Zeng@ZhiyuanZeng_ · Jul 9

Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data…

AAi2@allen_ai · Jul 9

Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵

9

82

269

87

52.0K

Zhiyuan Zeng Retweeted

S

Scott Geng@scottgeng00 · Jul 9

🤔 How do we train AI models that surpass their teachers? 🚨 In #COLM2025: ✨Delta learning ✨makes LLM post-training cheap and easy – with only weak data, we beat open 8B SOTA 🤯 The secret? Learn from the *differences* in weak data pairs! 📜 arxiv.org/abs/2507.06187 🧵 below

7

49

161

110

21.0K

Zhiyuan Zeng Retweeted

A

Allen School@uwcse · Jul 9

#UWAllen @UW & @nvidia researchers earned a #MLSys2025 Best Paper Award for boosting #LLM performance with FlashInfer—and showed “what’s possible when academia, industry & the open-source community innovate together,” says @ye_combinator. #AI #UWdiscovers news.cs.washington.edu/2025/07/01/all…

0

7

19

2

3.0K

Z

Zhiyuan Zeng@ZhiyuanZeng_ · Jul 8

EvalTree accepted to @COLM_conf 2025 - my first PhD work and first COLM paper 🙌! What would you like to see next—extensions, applications, or other directions? Always open to ideas! 🧐

ZZhiyuan Zeng@ZhiyuanZeng_ · Mar 14

Is a single accuracy number all we can get from model evals?🤔 🚨Does NOT tell where the model fails 🚨Does NOT tell how to improve it Introducing EvalTree🌳 🔍identifying LM weaknesses in natural language 🚀weaknesses serve as actionable guidance (paper&demo 🔗in🧵) [1/n]

6

14

201

33

19.0K

Z

Zhiyuan Zeng@ZhiyuanZeng_ · Jul 4

Worried about overfitting to IFEval? 🤔 Use ✨IFBench✨ our new, challenging instruction-following benchmark! Loved working w/ @valentina__py! Personal highlight: our multi-turn eval setting makes it possible to isolate constraint-following from the rest of the instruction 🔍

VValentina Pyatkin@valentina__py · Jul 3

💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR. But the set of constraints and verifier functions is limited and most models overfit on IFEval. We introduce IFBench to measure model generalization to unseen constraints.

2

14

53

16

10.0K

Zhiyuan Zeng Retweeted

V

Valentina Pyatkin@valentina__py · Jul 3

💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR. But the set of constraints and verifier functions is limited and most models overfit on IFEval. We introduce IFBench to measure model generalization to unseen constraints.

5

92

352

182

46.0K

Zhiyuan Zeng Retweeted

J

Joongwon Kim@danieljwkim · Jul 3

Can we improve Llama 3’s reasoning abilities through post-training only? Introducing ASTRO, our new framework that teaches LLMs to perform in-context search and generate long CoT to solve math problems, via SFT and RL. Work done at @aiatmeta. 📄 Paper: arxiv.org/abs/2507.00417

5

51

263

270

177.0K

Zhiyuan Zeng Retweeted

J

Junhao Chen@Cumquaaa · Jun 30

🚀 Training an image generation model and picking sides between autoregressive (AR) and diffusion? Why not both? Check out MADFormer with half of the model layers for AR and half for diffusion. AR gives a fast guess for the next patch prediction while diffusion helps refine the…

4

10

38

21

6.0K

Zhiyuan Zeng Retweeted

C

Chenghao Yang@chrome1996 · Jun 24

Have you noticed… 🔍 Aligned LLM generations feel less diverse? 🎯 Base models are decoding-sensitive? 🤔 Generations get more predictable as they progress? 🌲 Tree search fails mid-generation (esp. for reasoning)? We trace these mysteries to LLM probability concentration, and…

1

29

90

31

14.0K

Zhiyuan Zeng Retweeted

Z

Zhoujun (Jorge) Cheng@ChengZhoujun · Jun 20

🤯What we know about RL for reasoning might not hold outside math and code? We revisit established findings on RL for LLM reasoning on six domains (Math, Code, Science, Logic, Simulation, Tabular) and found that previous conclusions drawn on math and code are surprisingly…

4

57

245

183

42.0K

Zhiyuan Zeng Retweeted

H

Hao Xu@xuhaoxh · Jun 17

Wanna 🔎 inside Internet-scale LLM training data w/o spending 💰💰💰? Introducing infini-gram mini, an exact-match search engine with 14x less storage req than the OG infini-gram 😎 We make 45.6 TB of text searchable. Read on to find our Web Interface, API, and more. (1/n) ⬇️

6

21

61

32

20.0K

Z

Zhiyuan Zeng@ZhiyuanZeng_ · Jun 13

Yayyy!!! Best paper honorable mention at CVPR goes to our Molmo and Pixmo @allen_ai! This is now becoming a tend :) Last year both OLMo and Dolma received best paper awards at ACL.

##CVPR2025@CVPR · Jun 13

7

9

122

8

16.0K

Zhiyuan Zeng Retweeted

S

Sarah Wiegreffe @ ICML@sarahwiegreffe · Jun 13

A bit late to announce, but I’m excited to share that I'll be starting as an assistant professor at the University of Maryland @umdcs this August. I'll be recruiting PhD students this upcoming cycle for fall 2026. (And if you're a UMD grad student, sign up for my fall seminar!)

70

50

608

84

41.0K

Z

Zhiyuan Zeng@ZhiyuanZeng_ · Jun 13

🎉Our Spurious Rewards is available on ArXiv! We added experiments on - More prompts/steps/models/analysis... - Spurious Prompts! Surprisingly, we obtained 19.4% gains when replacing prompts with LaTex placeholder text (\lipsum) 😶‍🌫️ Check out our 2nd blog: tinyurl.com/spurious-prompt

SStella Li@StellaLisy · May 27

🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…

4

40

220

84

29.0K

Z

Zhiyuan Zeng@ZhiyuanZeng_ · Jun 13

Spurious Rewards was not all‼️We now present spurious PROMPTS🤔 check out our latest findings and discussion on evaluation: tinyurl.com/spurious-prompt. Who knew Lorem ipsum can bring 19.4% gains compared to default prompt👀 Also, arXiv is out🤩 arxiv.org/abs/2506.10947📄

SStella Li@StellaLisy · May 27

🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…

5

27

188

129

43.0K

Zhiyuan Zeng Retweeted

A

Ai2@allen_ai · Jun 12

We are #1 on the @huggingface heatmap - this is what true openness looks like!🥇🎉 750+ models 230+ datasets And counting... Come build with us huggingface.co/spaces/cfahlgr…

8

27

171

32

30.0K