Xi Ye
@xiye_nlp
I study NLP. Postdoc fellow @PrincetonPLI. Incoming assistant professor @UAlberta (Summer 2025). CS PhD @UTAustin.
I am very excited to share that I am joining University of Alberta @UAlberta as an assistant professor in Summer 2025. Before that, I will spend a year at Princeton PLI @PrincetonPLI working on language models.
(1/4)🚨 Introducing Goedel-Prover V2 🚨 🔥🔥🔥 The strongest open-source theorem prover to date. 🥇 #1 on PutnamBench: Solves 64 problems—with far less compute. 🧠 New SOTA on MiniF2F: * 32B model hits 90.4% at Pass@32, beating DeepSeek-Prover-V2-671B’s 82.4%. * 8B > 671B: Our 8B…
Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data…
Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
“Our team of research software engineers have played a key role in the cutting-edge research in AI at Princeton that is being picked up by industry and also garnering awards and recognition at leading conferences.” Meet the AI Lab RSEs: bit.ly/3T6h4Fy
There are many KV cache-reduction methods, but a fair comparison is challenging. We propose a new unified metric called “critical KV footprint”. We compare existing methods and propose a new one - PruLong, which “prunes” certain attn heads to only look at local tokens. 1/7
There’s been hot debate about (The Illusion of) The Illusion of Thinking. My take: it’s not that models can’t reason — they just aren’t perfect at long-form generation yet. We eval reasoning models on LongProc benchmark (requiring generating 8K CoTs, see thread). Reasoning…
🤔Now most LLMs have >= 128K context sizes, but are they good at generating long outputs, such as writing 8K token chain-of-thought for a planning problem? 🔔Introducing LongProc (Long Procedural Generation), a new benchmark with 6 diverse tasks that challenge LLMs to synthesize…
LLMs trained to memorize new facts can’t use those facts well.🤔 We apply a hypernetwork to ✏️edit✏️ the gradients for fact propagation, improving accuracy by 2x on a challenging subset of RippleEdit!💡 Our approach, PropMEND, extends MEND with a new objective for propagation.
Just updated my profile from "incoming assistant professor at U of A (starting July 2025)" to "... (as soon as I get my work permit)". Applied for the Canadian work permit on Nov 2024, still no news at all from IRCC about any update nor any estimated timeline. Definitely an…
What if we could compose reasoning skills together like LEGO🧩? Checkout @fangcong_y10593 work! We find a way to train models on Skill A and Skill B, then enabling compositional reasoning of A + B together .
Solving complex problems with CoT requires combining different skills. We can do this by: 🧩Modify the CoT data format to be “composable” with other skills 🔥Train models on each skill 📌Combine those models Lead to better 0-shot reasoning on tasks involving skill composition!
Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇
Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts! ✍🏻Entirely human-written questions by 13 CS researchers 👀Emphasis on visual reasoning – hard to be verbalized via text CoTs 📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B
Some personal news: I'll join @UMassAmherst CS as an assistant professor in fall 2026. Until then, I'll postdoc at @Meta nyc. Reasoning will continue to be my main interest, with a focus on data-centric approaches🤩 If you're also interested, apply to me (phds & a postdoc)!
In a new blog post, @HowardYen1 and @xiye_nlp introduce HELMET and LongProc, two benchmarks from a recent effort to build a holistic test suite for evaluating long-context LMs. Read now: pli.princeton.edu/blog/2025/long…
Evaluating language model responses on open-ended tasks is hard! 🤔 We introduce EvalAgent, a framework that identifies nuanced and diverse criteria 📋✍️. EvalAgent identifies 👩🏫🎓 expert advice on the web that implicitly address the user’s prompt 🧵👇