Phillip (Yuseung) Lee
@yuseungleee
PhD student @kaist_ai
❗️Vision-Language Models (VLMs) struggle with even basic perspective changes! ✏️ In our new preprint, we aim to extend the spatial reasoning capabilities of VLMs to ⭐️arbitrary⭐️ perspectives. 📄Paper: arxiv.org/abs/2504.17207 🔗Project: apc-vlm.github.io 🧵[1/N]
![yuseungleee's tweet image. ❗️Vision-Language Models (VLMs) struggle with even basic perspective changes!
✏️ In our new preprint, we aim to extend the spatial reasoning capabilities of VLMs to ⭐️arbitrary⭐️ perspectives.
📄Paper: arxiv.org/abs/2504.17207
🔗Project: apc-vlm.github.io
🧵[1/N]](https://pbs.twimg.com/media/Gps5FBDbAAAHzZR.jpg)
Hidden in plain sight: VLMs overlook their visual representations "Across a series of vision-centric benchmarks (e.g., depth estimation, correspondence), we find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance." "VLMs are…
VLMs often struggle with physical reasoning tasks such as spatial reasoning. Excited to share how we can use world models + test-time search to zero-shot improve spatial reasoning in VLMs!
MindJourney Test-Time Scaling with World Models for Spatial Reasoning
What exactly is a "world model"? And what limits existing video generation models from being true world models? In my new blog post, I argue that a true video world model must be causal, interactive, persistent, real-time, and physical accurate. xunhuang.me/blogs/world_mo…
Can VLMs build Spatial Mental Models like humans? Reasoning from limited views? Reasoning from partial observations? Reasoning about unseen objects behind furniture / beyond current view? Check out MindCube! 🌐mll-lab-nu.github.io/mind-cube/ 📰arxiv.org/pdf/2506.21458…
#ICCV2025 Introducing X-Fusion: Introducing New Modality to Frozen Large Language Models It is a novel framework that adapts pretrained LLMs (e.g., LLaMA) to new modalities (e.g., vision) while retaining their language capabilities and world knowledge! (1/n) Project Page:…
🔍 New paper: How do vision-language models actually align visual- and language representations? We used sparse autoencoders to peek inside VLMs and found something surprising about when and where cross-modal alignment happens! Presented at XAI4CV Workshop @ CVPR 🧵 (1/6)
(1/n) Time to unify your favorite visual generative models, VLMs, and simulators for controllable visual generation—Introducing a Product of Experts (PoE) framework for inference-time knowledge composition from heterogeneous models.
We present our paper "Ψ-Sampler: Initial Particle Sampling for SMC-Based Inference-Time Reward Alignment in Score Models" Check out more details arXiv: arxiv.org/abs/2506.01320 Website: psi-sampler.github.io
🧠 More Thinking, Less Seeing? 👀 Exploring the Balance Between Reasoning and Hallucination in Multimodal Reasoning Models! Currently many multimodal reasoning models while striving for enhanced reasoning capabilities often neglect the issue of visual hallucinations. While…
How can we get VLMs to move their eyes—and reason step-by-step in visually grounded ways? 👀 We introduce ViGoRL, a RL method that anchors reasoning to image regions. 🎯 It outperforms vanilla GRPO and SFT across grounding, spatial tasks, and visual search (86.4% on V*). 👇🧵
Elevate Visual-Spatial Intelligence with Spatial-MLLM! 🚀🚀🚀 Discover how we incorporate 3D information to help MLLMs better think in space in our work: Spatial-MLLM. 🔗Code: github.com/diankun-wu/Spa… 🌐Project Page: diankun-wu.github.io/Spatial-MLLM/ 📄Paper: arxiv.org/abs/2505.23747
🔥🔥 Introducing 𝗩𝗟𝗠-𝟯𝗥: 𝗩𝗶𝘀𝗶𝗼𝗻-𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 with Instruction-Aligned 𝟯𝗗 𝗥econstruction 📡 Monocular videos are everywhere, yet current VLMs struggle to extract deep 🛰️ 𝗦𝗽𝗮𝘁𝗶𝗮𝗹 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝗰𝗲 from them. Existing methods often rely…
Open-sourcing nanoDiT -- an educational repository to show rectified-flow training of class-conditional DiTs for image generation (~600 LoC). Hope that helps: github.com/sayakpaul/nano…
It’s kind of mindblowing how good Veo 3 is at modeling intuitive physics. Our world models are getting pretty good, & in my view has important implications regarding the computational complexity of the world - the last line of my bio for me has always been the ultimate quest ⬆️
Prompt Theory (Made with Veo 3) What if AI-generated characters refused to believe they were AI-generated?
🚀 New Paper: Pixel Reasoner 🧠🖼️ How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself? We introduce Pixel Reasoner, the first open-source framework that enables VLMs to “think in pixel space” through curiosity-driven reinforcement…
Before o3 impressed everyone with 🔥visual reasoning🔥, we already had faith in and were exploring models that can think with images. 🚀 Here’s our shot, GRIT: Grounded Reasoning with Images & Texts that trains MLLMs to think while performing visual grounding. It is done via RL…
🔥 News - 2nd Unlearning and Model Editing Workshop and Challenge at #ICCV2025 📃 Call for papers ready and OpenReview accepting submissions: bit.ly/4knWGv2 🧩 New challenge on #Unlearning bit.ly/43GkzbK / unlearning.iab-rubric.org Best performers in paper!
🚀Let’s Think Only with Images. No language and No verbal thought.🤔 Let’s think through a sequence of images💭, like how humans picture steps in their minds🎨. We propose Visual Planning, a novel reasoning paradigm that enables models to reason purely through images.
This update brings a completely new set of creative capabilities and improvements to References. An interesting emergent property is the ability of the model to precisely place objects in your scene using a layout you can provide. If you find new use cases, please share them.
We've released an update to Gen-4 References that brings marked improvements to aesthetic quality, scene composition and identity preservation. Alongside this update comes a number of exciting new use cases which we'll be sharing more about in the coming days. Gen-4 References…
❗️❗️ Can MLLMs understand scenes from multiple camera viewpoints — like humans? 🧭 We introduce All-Angles Bench — 2,100+ QA pairs on multi-view scenes. 📊 We evaluate 27 top MLLMs, including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o. 🌐 Project: danielchyeh.github.io/All-Angles-Ben…