Yizhong Wang

@yizhongwyz

Incoming assistant professor @UTCompSci, RS @BytedanceTalk, PhD from @uwcse, formerly @allen_ai @AIatMeta @MSFTResearch

Seattle

Joined April 2015

1KFollowing

5KFollowers

Pinned

Yizhong Wang@yizhongwyz · May 30

Thrilled to announce that I will be joining @UTAustin @UTCompSci as an assistant professor in fall 2026! I will continue working on language models, data challenges, learning paradigms, & AI for innovation. Looking forward to teaming up with new students & colleagues! 🤠🤘

yizhongwyz's tweet image. Thrilled to announce that I will be joining @UTAustin @UTCompSci as an assistant professor in fall 2026!

I will continue working on language models, data challenges, learning paradigms, &amp; AI for innovation. Looking forward to teaming up with new students &amp; colleagues! 🤠🤘

101

669

72.0K

Yizhong Wang@yizhongwyz · Jul 9

Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data…

AAi2@allen_ai · Jul 9

Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵

269

51.0K

Yizhong Wang@yizhongwyz · Jul 8

EvalTree accepted to @COLM_conf 2025 - my first PhD work and first COLM paper 🙌! What would you like to see next—extensions, applications, or other directions? Always open to ideas! 🧐

ZZhiyuan Zeng@ZhiyuanZeng_ · Mar 14

Is a single accuracy number all we can get from model evals?🤔 🚨Does NOT tell where the model fails 🚨Does NOT tell how to improve it Introducing EvalTree🌳 🔍identifying LM weaknesses in natural language 🚀weaknesses serve as actionable guidance (paper&demo 🔗in🧵) [1/n]

201

19.0K

Yizhong Wang Retweeted

Valentina Pyatkin@valentina__py · Jul 3

💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR. But the set of constraints and verifier functions is limited and most models overfit on IFEval. We introduce IFBench to measure model generalization to unseen constraints.

353

182

46.0K

Yizhong Wang Retweeted

Leo Liu@ZEYULIU10 · Jun 16

LLMs trained to memorize new facts can’t use those facts well.🤔 We apply a hypernetwork to ✏️edit✏️ the gradients for fact propagation, improving accuracy by 2x on a challenging subset of RippleEdit!💡 Our approach, PropMEND, extends MEND with a new objective for propagation.

195

112

28.0K

Yizhong Wang Retweeted

Jacqueline He@jcqln_h · Jun 10

LMs often output answers that sound right but aren’t supported by input context. This is intrinsic hallucination: the generation of plausible, but unsupported content. We propose Precise Information Control (PIC): a task requiring LMs to ground only on given verifiable claims.

8.0K

Yizhong Wang Retweeted

Jiacheng Liu@liujc1998 · Jun 4

We enabled OLMoTrace for Tülu 3 models! 🤠 Matched spans are shorter than for OLMo models, bc we can only search in Tülu's post-training data (base model is Llama). Yet we thought it'd still bring some value. Try yourself on the Ai2 playground -- playground.allenai.org

4.0K

Yizhong Wang@yizhongwyz · Jun 4

CONGRATS 🎉!

AAllen School@uwcse · Jun 4

Congratulations to @UW #UWAllen Ph.D. grads @sharma_ashish_2 & @sewon__min, @TheOfficialACM Doctoral Dissertation Award honorees! Sharma won for #AI tools for mental health; Min received honorable mention for efficient, flexible language models. #ThisIsUW news.cs.washington.edu/2025/06/04/all…

3.0K

Yizhong Wang Retweeted

Saumya Malik@saumyamalik44 · Jun 2

I’m thrilled to share RewardBench 2 📊— We created a new multi-domain reward model evaluation that is substantially harder than RewardBench, we trained and released 70 reward models, and we gained insights about reward modeling benchmarks and downstream performance!

232

112

35.0K

Yizhong Wang Retweeted

Stella Li@StellaLisy · May 27

🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…

343

2.0K

1.0K

681.0K

Yizhong Wang Retweeted

Percy Liang@percyliang · May 19

What would truly open-source AI look like? Not just open weights, open code/data, but *open development*, where the entire research and development process is public *and* anyone can contribute. We built Marin, an open lab, to fulfill this vision:

203

1.0K

424

146.0K

Yizhong Wang@yizhongwyz · May 15

Glad to share that our AgoraBench paper has been accepted at @aclmeeting 2025 (main)! Special thanks to our coauthors @scott_sjy @xiangyue96 @vijaytarian @sylee_ai @yizhongwyz @kgashteo Carolin @wellecks @gneubig! A belief I hold more firmly now than when I started this project…

SSeungone Kim@seungonekim · Dec 6

#NLProc Just because GPT-4o is 17 times more expensive than GPT-4o-mini, does that mean it generates synthetic data 17 times better? Introducing the AgoraBench, a benchmark for evaluating data generation capabilities of LMs.

6.0K

Yizhong Wang Retweeted

Gonçalo Faria@goncalorafaria · Apr 8

Introducing 𝗤𝗔𝗹𝗶𝗴𝗻🚀, a 𝘁𝗲𝘀𝘁-𝘁𝗶𝗺𝗲 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 𝗺𝗲𝘁𝗵𝗼𝗱 that improves language model performance using Markov chain Monte Carlo. With no model retraining, 𝗤𝗔𝗹𝗶𝗴𝗻 outperforms DPO-tuned models even when allowed to match inference compute, and achieves…

115

29.0K

Yizhong Wang@yizhongwyz · Apr 9

Today we're unveiling OLMoTrace, a tool that enables everyone to understand the outputs of LLMs by connecting to their training data. We do this on unprecedented scale and in real time: finding matching text between model outputs and 4 trillion training tokens within seconds. ✨

AAi2@allen_ai · Apr 9

For years it’s been an open question — how much is a language model learning and synthesizing information, and how much is it just memorizing and reciting? Introducing OLMoTrace, a new feature in the Ai2 Playground that begins to shed some light. 🔦

287

131

45.0K

Yizhong Wang@yizhongwyz · Apr 8

🚀After a year of development based on our OSWorld, Computer Use Agent Arena is LIVE! Test top AI agents (Operator, Claude 3.7...) on any kinds of computer use tasks with zero setup. Cloud-hosted, safe, and FREE! Try it now: arena.xlang.ai ! Data & code coming soon!

BBowen Wang@BowenWangNLP · Apr 8

🎮 Computer Use Agent Arena is LIVE! 🚀 🔥 Easiest way to test computer-use agents in the wild without any setup 🌟 Compare top VLMs: OpenAI Operator, Claude 3.7, Gemini 2.5 Pro, Qwen 2.5 vl and more 🕹️ Test agents on 100+ real apps & webs with one-click config 🔒 Safe & free…

101

14.0K

Yizhong Wang Retweeted

Bowen Wang@BowenWangNLP · Apr 8

105

334

210

88.0K

Yizhong Wang Retweeted

Alisa Liu @ ICML 🚀@alisawuffles · Mar 21

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

329

3.0K

1.0K

361.0K

Yizhong Wang@yizhongwyz · Mar 14

As the evaluation of LMs becomes increasingly complex and broad, how can we draw insights beyond simple metrics? Check out our latest work led by @ZhiyuanZeng_ !

ZZhiyuan Zeng@ZhiyuanZeng_ · Mar 14

2.0K

Yizhong Wang Retweeted

Ai2@allen_ai · Mar 13

Announcing OLMo 2 32B: the first fully open model to beat GPT 3.5 & GPT-4o mini on a suite of popular, multi-skill benchmarks. Comparable to best open-weight models, but a fraction of training compute. When you have a good recipe, ✨ magical things happen when you scale it up!

160

670

219

90.0K

Yizhong Wang Retweeted

Hanna Hajishirzi@HannaHajishirzi · Feb 24

Excited to drive innovation and push the boundaries of open, scientific AI research & development! 🚀 Join us at @allen_ai to shape the future of OLMo, Molmo, Tulu, and more. We’re hiring at all levels—apply now! 👇 #AI #Hiring Research Engineer job-boards.greenhouse.io/thealleninstit… Research…

57.0K