Phillip (Yuseung) Lee

@yuseungleee

PhD student @kaist_ai

대한민국 대전

Joined June 2022

506Following

364Followers

Pinned

Phillip (Yuseung) Lee@yuseungleee · Apr 29

❗️Vision-Language Models (VLMs) struggle with even basic perspective changes! ✏️ In our new preprint, we aim to extend the spatial reasoning capabilities of VLMs to ⭐️arbitrary⭐️ perspectives. 📄Paper: arxiv.org/abs/2504.17207 🔗Project: apc-vlm.github.io 🧵[1/N]

yuseungleee's tweet image. ❗️Vision-Language Models (VLMs) struggle with even basic perspective changes!

✏️ In our new preprint, we aim to extend the spatial reasoning capabilities of VLMs to ⭐️arbitrary⭐️ perspectives.

📄Paper: arxiv.org/abs/2504.17207
🔗Project: apc-vlm.github.io

🧵[1/N]

147

17.0K

Pinned

Phillip (Yuseung) Lee Retweeted

Tanishq Abraham is at ICML@iScienceLuvr · Jun 11

Hidden in plain sight: VLMs overlook their visual representations "Across a series of vision-centric benchmarks (e.g., depth estimation, correspondence), we find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance." "VLMs are…

308

218

24.0K

Phillip (Yuseung) Lee@yuseungleee · Jul 18

VLMs often struggle with physical reasoning tasks such as spatial reasoning. Excited to share how we can use world models + test-time search to zero-shot improve spatial reasoning in VLMs!

AAK@_akhaliq · Jul 18

MindJourney Test-Time Scaling with World Models for Spatial Reasoning

186

24.0K

Phillip (Yuseung) Lee Retweeted

Xun Huang@xunhuang1995 · Jul 11

What exactly is a "world model"? And what limits existing video generation models from being true world models? In my new blog post, I argue that a true video world model must be causal, interactive, persistent, real-time, and physical accurate. xunhuang.me/blogs/world_mo…

254

155

117.0K

Phillip (Yuseung) Lee Retweeted

Manling Li@ManlingLi_ · Jun 30

Can VLMs build Spatial Mental Models like humans? Reasoning from limited views? Reasoning from partial observations? Reasoning about unseen objects behind furniture / beyond current view? Check out MindCube! 🌐mll-lab-nu.github.io/mind-cube/ 📰arxiv.org/pdf/2506.21458…

281

233

38.0K

Phillip (Yuseung) Lee Retweeted

Sicheng Mo@sicheng_mo · Jun 26

#ICCV2025 Introducing X-Fusion: Introducing New Modality to Frozen Large Language Models It is a novel framework that adapts pretrained LLMs (e.g., LLaMA) to new modalities (e.g., vision) while retaining their language capabilities and world knowledge! （1/n） Project Page:…

6.0K

Phillip (Yuseung) Lee Retweeted

Constantin Venhoff@cvenhoff00 · Jun 20

🔍 New paper: How do vision-language models actually align visual- and language representations? We used sparse autoencoders to peek inside VLMs and found something surprising about when and where cross-modal alignment happens! Presented at XAI4CV Workshop @ CVPR 🧵 (1/6)

301

401

67.0K

Phillip (Yuseung) Lee Retweeted

Yunzhi Zhang@zhang_yunzhi · Jun 14

(1/n) Time to unify your favorite visual generative models, VLMs, and simulators for controllable visual generation—Introducing a Product of Experts (PoE) framework for inference-time knowledge composition from heterogeneous models.

302

218

44.0K

Phillip (Yuseung) Lee Retweeted

Taehoon Yoon@taehoonyoon_ · Jun 4

We present our paper "Ψ-Sampler: Initial Particle Sampling for SMC-Based Inference-Time Reward Alignment in Score Models" Check out more details arXiv: arxiv.org/abs/2506.01320 Website: psi-sampler.github.io

363

Phillip (Yuseung) Lee Retweeted

Chengzhi Liu@liuchen02938149 · Jun 2

🧠 More Thinking, Less Seeing? 👀 Exploring the Balance Between Reasoning and Hallucination in Multimodal Reasoning Models! Currently many multimodal reasoning models while striving for enhanced reasoning capabilities often neglect the issue of visual hallucinations. While…

27.0K

Phillip (Yuseung) Lee Retweeted

Gabriel Sarch@GabrielSarch · May 30

How can we get VLMs to move their eyes—and reason step-by-step in visually grounded ways? 👀 We introduce ViGoRL, a RL method that anchors reasoning to image regions. 🎯 It outperforms vanilla GRPO and SFT across grounding, spatial tasks, and visual search (86.4% on V*). 👇🧵

435

416

53.0K

Phillip (Yuseung) Lee Retweeted

Fangfu Liu@fangfu0830 · May 30

Elevate Visual-Spatial Intelligence with Spatial-MLLM! 🚀🚀🚀 Discover how we incorporate 3D information to help MLLMs better think in space in our work: Spatial-MLLM. 🔗Code: github.com/diankun-wu/Spa… 🌐Project Page: diankun-wu.github.io/Spatial-MLLM/ 📄Paper: arxiv.org/abs/2505.23747

172

9.0K

Phillip (Yuseung) Lee Retweeted

Zhengzhong Tu@_vztu · May 28

🔥🔥 Introducing 𝗩𝗟𝗠-𝟯𝗥: 𝗩𝗶𝘀𝗶𝗼𝗻-𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 with Instruction-Aligned 𝟯𝗗 𝗥econstruction 📡 Monocular videos are everywhere, yet current VLMs struggle to extract deep 🛰️ 𝗦𝗽𝗮𝘁𝗶𝗮𝗹 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝗰𝗲 from them. Existing methods often rely…

6.0K

Phillip (Yuseung) Lee Retweeted

Sayak Paul@RisingSayak · May 28

Open-sourcing nanoDiT -- an educational repository to show rectified-flow training of class-conditional DiTs for image generation (~600 LoC). Hope that helps: github.com/sayakpaul/nano…

230

102

17.0K

Phillip (Yuseung) Lee@yuseungleee · May 23

It’s kind of mindblowing how good Veo 3 is at modeling intuitive physics. Our world models are getting pretty good, & in my view has important implications regarding the computational complexity of the world - the last line of my bio for me has always been the ultimate quest ⬆️

HHashem Al-Ghaili@HashemGhaili · May 22

Prompt Theory (Made with Veo 3) What if AI-generated characters refused to believe they were AI-generated?

168

386

4.0K

1.0K

747.0K

Phillip (Yuseung) Lee Retweeted

Wenhu Chen@WenhuChen · May 23

🚀 New Paper: Pixel Reasoner 🧠🖼️ How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself? We introduce Pixel Reasoner, the first open-source framework that enables VLMs to “think in pixel space” through curiosity-driven reinforcement…

394

322

81.0K

Phillip (Yuseung) Lee Retweeted

Yue Fan@YFan_UCSC · May 23

Before o3 impressed everyone with 🔥visual reasoning🔥, we already had faith in and were exploring models that can think with images. 🚀 Here’s our shot, GRIT: Grounded Reasoning with Images & Texts that trains MLLMs to think while performing visual grounding. It is done via RL…

165

152

70.0K

Phillip (Yuseung) Lee Retweeted

Iacopo Masi@_iAc · May 22

🔥 News - 2nd Unlearning and Model Editing Workshop and Challenge at #ICCV2025 📃 Call for papers ready and OpenReview accepting submissions: bit.ly/4knWGv2 🧩 New challenge on #Unlearning bit.ly/43GkzbK / unlearning.iab-rubric.org Best performers in paper!

3.0K

Phillip (Yuseung) Lee Retweeted

Yi Xu@_yixu · May 19

🚀Let’s Think Only with Images. No language and No verbal thought.🤔 Let’s think through a sequence of images💭, like how humans picture steps in their minds🎨. We propose Visual Planning, a novel reasoning paradigm that enables models to reason purely through images.

213

1.0K

201.0K

Phillip (Yuseung) Lee@yuseungleee · May 9

This update brings a completely new set of creative capabilities and improvements to References. An interesting emergent property is the ability of the model to precisely place objects in your scene using a layout you can provide. If you find new use cases, please share them.

RRunway@runwayml · May 9

We've released an update to Gen-4 References that brings marked improvements to aesthetic quality, scene composition and identity preservation. Alongside this update comes a number of exciting new use cases which we'll be sharing more about in the coming days. Gen-4 References…

667

497

125.0K

Phillip (Yuseung) Lee Retweeted

Chun-Hsiao (Daniel) Yeh@danielyehhh · May 7

❗️❗️ Can MLLMs understand scenes from multiple camera viewpoints — like humans? 🧭 We introduce All-Angles Bench — 2,100+ QA pairs on multi-view scenes. 📊 We evaluate 27 top MLLMs, including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o. 🌐 Project: danielchyeh.github.io/All-Angles-Ben…

17.0K