Xingyu Fu ✈️ ICML25
@XingyuFu2
Postdoc Fellow @PrincetonPLI | PhD @Penn @cogcomp. | Focused on Vision+Language | Previous: @MSFTResearch @AmazonScience B.S. @UofIllinois | ⛳️😺
(1/4)🚨 Introducing Goedel-Prover V2 🚨 🔥🔥🔥 The strongest open-source theorem prover to date. 🥇 #1 on PutnamBench: Solves 64 problems—with far less compute. 🧠 New SOTA on MiniF2F: * 32B model hits 90.4% at Pass@32, beating DeepSeek-Prover-V2-671B’s 82.4%. * 8B > 671B: Our 8B…
I will be in #ICML2025 next week and present #ReFocus on Tuesday afternoon. 📍 West Exhibition Hall B2-B3 #W-202 ⏱️ Tue 15 Jul 4:30 p.m. PDT - 7 p.m. PDT Happy to chat and connect! Feel free to DM 😁 ReFocus link: huggingface.co/datasets/ReFoc…
Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data…
Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵
People are racing to push math reasoning performance in #LLMs—but have we really asked why? The common assumption is that improving math reasoning should transfer to broader capabilities in other domains. But is that actually true? In our study (arxiv.org/pdf/2507.00432), we…
Research with amazing collaborators @JizeJiang, @MeitangLi, and @JingchengYang, guided by great advisors and supported by the generous help of talented researchers @BowenJin13, @XingyuFu2, and many open-source contributors (easyr1, verl, vllm... etc).
Excited to introduce VTool-R1! We’ve trained VLMs to “think visually” using RL, blending Python-based 🖼️visual edits with💡textual Chain-of-Thought reasoning. Our trained qwen2.5-VL-32B surpasses GPT-4o on ChartQA & TableVQA, and even the compact qwen2.5-VL-7B significantly…
😌Been wanting to post since March but waited for the graduation photo….Thrilled to finally share that I’ll be joining Princeton University as a postdoc @PrincetonPLI this August! Endless thanks to my incredible advisors and mentors from Penn, UW, Cornell, NYU, UCSB, USC,…

Refocus🔍 Visual reasoning for Tables and Charts with Edits Happy to share ReFocus accepted at #ICML2025. We’ve open-sourced code and training data: zeyofu.github.io/ReFocus/ ReFocus enables multimodal LMs to better reason on Tables and Charts with visual edits. It also provides…
Teach GPT-4o to edit on charts and tables to ReFocus 🔍 and facilitate reasoning 🧠! 🔥 We introduce ReFocus, which edits input table and chart images to better reason visually zeyofu.github.io/ReFocus/ 🤔 Can we teach smaller models to learn such visual CoT reasoning? 🚀 Yes --…
🚀 Introducing Science-T2I - Towards bridging the gap between AI imagination and scientific reality in image generation! [CVPR 2025] 📜 Paper: arxiv.org/abs/2504.13129 🌐 Project: jialuo-li.github.io/Science-T2I-Web 💻 Code: github.com/Jialuo-Li/Scie… 🤗 Dataset: huggingface.co/collections/Ji… 🔍…
This paper is interestingly thought- provoking for me. There is a chance, that it's easier to "align t2i model with real physics" in post-training. And let it learn to generate whatever (physically implausible) combinations in pretrain. As opposed to trying hard to come up with…
🚀 Introducing Science-T2I - Towards bridging the gap between AI imagination and scientific reality in image generation! [CVPR 2025] 📜 Paper: arxiv.org/abs/2504.13129 🌐 Project: jialuo-li.github.io/Science-T2I-Web 💻 Code: github.com/Jialuo-Li/Scie… 🤗 Dataset: huggingface.co/collections/Ji… 🔍…
🎉 Excited to share that our paper, "MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding", will be presented at #ICLR2025! 📅 Date: April 24 🕒 Time: 3:00 PM 📍 Location: Hall 3 + Hall 2B #11 MuirBench challenges multimodal LLMs with diverse multi-image…
Embedding a scientific basis in pre-trained T2I models can enhance the realism and consistency of the results. Cool work in "Science-T2I: Addressing Scientific Illusions in Image Synthesis" jialuo-li.github.io/Science-T2I-We…
Our previous work showed that 𝐜𝐫𝐞𝐚𝐭𝐢𝐧𝐠 𝐯𝐢𝐬𝐮𝐚𝐥 𝐜𝐡𝐚𝐢𝐧‑𝐨𝐟‑𝐭𝐡𝐨𝐮𝐠𝐡𝐭𝐬 𝐯𝐢𝐚 𝐭𝐨𝐨𝐥 𝐮𝐬𝐞 significantly boosts GPT‑4o’s visual reasoning performance. Excited to see this idea incorporated into OpenAI’s o3 and o4‑mini models (openai.com/index/thinking…).…
Visual Chain-of-Thought with ✏️Sketchpad Happy to share ✏️Visual Sketchpad accepted to #NeurIPS2024. Sketchpad thinks🤔by creating visual reasoning chains for multimodal LMs, enhancing GPT-4o's reasoning on math and vision tasks We’ve open-sourced code: visualsketchpad.github.io
#ICLR2025 Oral LLMs often struggle with reliable and consistent decisions under uncertainty 😵💫 — largely because they can't reliably estimate the probability of each choice. We propose BIRD 🐦, a framework that significantly enhances LLM decision making under uncertainty. BIRD…
Excited to see the image reasoning in o3 and o4-mini!!🤩 We introduced this idea a year ago in visual Sketchpad (visualsketchpad.github.io). Excited to see @OpenAI baking this into their model through agentic RL. Great work! And yes, reasoning should be multimodal! Huge shoutout…
Introducing OpenAI o3 and o4-mini—our smartest and most capable models to date. For the first time, our reasoning models can agentically use and combine every tool within ChatGPT, including web search, Python, image analysis, file interpretation, and image generation.
Check our new paper on long context understanding! We use AgenticLU to significantly improve base model’s long contex performance (+14.7% avg on several datasets) without any scaling in the real inference time!
LLMs struggle with long-context reasoning—retrieving key info & clarifying complex queries. We introduce Agentic Long-context Understanding (AgenticLU), an agentic framework that: ✅ Uses Chain-of-Clarifications (CoC) to iteratively refine queries & retrieve relevant evidence.…
Muirbench has been accepted to #ICLR2025! 🚀 Companies like Apple, TikTok, and Salesforce are already evaluating their LMMs on its multi-image setup—a robust testbed for multimodal reasoning. GenAI needs more benchmarks like this.🤯 Kudos to @fwang_nlp, @XingyuFu2, and team! 👏
Can GPT-4o and Gemini-Pro handle 𝐦𝐮𝐥𝐭𝐢𝐩𝐥𝐞 𝐢𝐦𝐚𝐠𝐞𝐬? Introducing MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding. 🌐 Explore here: muirbench.github.io 📄 Paper: arxiv.org/abs/2406.09411 📊 Data: huggingface.co/datasets/MUIRB…
𝗠𝘂𝗶𝗿𝗕𝗲𝗻𝗰𝗵 is officially accepted at #ICLR2025! 🎉 Recent VLMs/MLLMs such as LLaVA-OneVision, MM1.5, and MAmmoTH-VL have demonstrated significant progress on MuirBench.🚀 Excited to see how MuirBench continues to drive the innovation of VLMs! #AI #MachineLearning #VLM…
Can GPT-4o and Gemini-Pro handle 𝐦𝐮𝐥𝐭𝐢𝐩𝐥𝐞 𝐢𝐦𝐚𝐠𝐞𝐬? Introducing MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding. 🌐 Explore here: muirbench.github.io 📄 Paper: arxiv.org/abs/2406.09411 📊 Data: huggingface.co/datasets/MUIRB…