Xi Ye

@xiye_nlp

I study NLP. Postdoc fellow @PrincetonPLI. Incoming assistant professor @UAlberta (Summer 2025). CS PhD @UTAustin.

Joined March 2020

399Following

3KFollowers

Pinned

Xi Ye@xiye_nlp · May 10, 2024

I am very excited to share that I am joining University of Alberta @UAlberta as an assistant professor in Summer 2025. Before that, I will spend a year at Princeton PLI @PrincetonPLI working on language models.

319

41.0K

Xi Ye Retweeted

Yong Lin@Yong18850571 · Jul 15

(1/4)🚨 Introducing Goedel-Prover V2 🚨 🔥🔥🔥 The strongest open-source theorem prover to date. 🥇 #1 on PutnamBench: Solves 64 problems—with far less compute. 🧠 New SOTA on MiniF2F: * 32B model hits 90.4% at Pass@32, beating DeepSeek-Prover-V2-671B’s 82.4%. * 8B > 671B: Our 8B…

248

118

56.0K

Xi Ye@xiye_nlp · Jul 9

Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data…

AAi2@allen_ai · Jul 9

Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵

269

52.0K

Xi Ye Retweeted

CLS@ChengleiSi · Jun 30

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

170

599

204

139.0K

Xi Ye Retweeted

Princeton Laboratory for Artificial Intelligence@PrincetonAInews · Jun 24

“Our team of research software engineers have played a key role in the cutting-edge research in AI at Princeton that is being picked up by industry and also garnering awards and recognition at leading conferences.” Meet the AI Lab RSEs: bit.ly/3T6h4Fy

2.0K

Xi Ye Retweeted

Adithya Bhaskar@AdithyaNLP · Jun 23

There are many KV cache-reduction methods, but a fair comparison is challenging. We propose a new unified metric called “critical KV footprint”. We compare existing methods and propose a new one - PruLong, which “prunes” certain attn heads to only look at local tokens. 1/7

230

194

30.0K

Xi Ye@xiye_nlp · Jun 19

There’s been hot debate about (The Illusion of) The Illusion of Thinking. My take: it’s not that models can’t reason — they just aren’t perfect at long-form generation yet. We eval reasoning models on LongProc benchmark (requiring generating 8K CoTs, see thread). Reasoning…

XXi Ye@xiye_nlp · Jan 14

🤔Now most LLMs have >= 128K context sizes, but are they good at generating long outputs, such as writing 8K token chain-of-thought for a planning problem？ 🔔Introducing LongProc (Long Procedural Generation), a new benchmark with 6 diverse tasks that challenge LLMs to synthesize…

4.0K

Xi Ye Retweeted

Leo Liu@ZEYULIU10 · Jun 16

LLMs trained to memorize new facts can’t use those facts well.🤔 We apply a hypernetwork to ✏️edit✏️ the gradients for fact propagation, improving accuracy by 2x on a challenging subset of RippleEdit!💡 Our approach, PropMEND, extends MEND with a new objective for propagation.

195

112

28.0K

Xi Ye Retweeted

Jocelyn(Qiaochu) Chen@jocelynqchen · Jun 5

Just updated my profile from "incoming assistant professor at U of A (starting July 2025)" to "... (as soon as I get my work permit)". Applied for the Canadian work permit on Nov 2024, still no news at all from IRCC about any update nor any estimated timeline. Definitely an…

11.0K

Xi Ye@xiye_nlp · Jun 2

What if we could compose reasoning skills together like LEGO🧩? Checkout @fangcong_y10593 work! We find a way to train models on Skill A and Skill B, then enabling compositional reasoning of A + B together .

FFangcong Yin@fangcong_y10593 · Jun 2

Solving complex problems with CoT requires combining different skills. We can do this by: 🧩Modify the CoT data format to be “composable” with other skills 🔥Train models on each skill 📌Combine those models Lead to better 0-shot reasoning on tasks involving skill composition!

2.0K

Xi Ye Retweeted

Shashwat Goel@ShashwatGoel7 · May 29

Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇

126

879

536

316.0K

Xi Ye Retweeted

Liyan Tang@LiyanTang4 · May 20

Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts! ✍🏻Entirely human-written questions by 13 CS researchers 👀Emphasis on visual reasoning – hard to be verbalized via text CoTs 📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B

11.0K

Xi Ye Retweeted

Wenting Zhao@wzhao_nlp · May 13

Some personal news: I'll join @UMassAmherst CS as an assistant professor in fall 2026. Until then, I'll postdoc at @Meta nyc. Reasoning will continue to be my main interest, with a focus on data-centric approaches🤩 If you're also interested, apply to me (phds & a postdoc)!

850

69.0K

Xi Ye Retweeted

Princeton PLI@PrincetonPLI · May 8

In a new blog post, @HowardYen1 and @xiye_nlp introduce HELMET and LongProc, two benchmarks from a recent effort to build a holistic test suite for evaluating long-context LMs. Read now: pli.princeton.edu/blog/2025/long…

3.0K

Xi Ye Retweeted

Manya Wadhwa@ManyaWadhwa1 · Apr 22

Evaluating language model responses on open-ended tasks is hard! 🤔 We introduce EvalAgent, a framework that identifies nuanced and diverse criteria 📋✍️. EvalAgent identifies 👩‍🏫🎓 expert advice on the web that implicitly address the user’s prompt 🧵👇

126

18.0K