Yang Zheng
@yang_zheng18
PhD student @Stanford
Can we reconstruct relightable human hair appearance from real-world visual observations? We introduce GroomLight, a hybrid inverse rendering method for relightable human hair appearance modeling. syntec-research.github.io/GroomLight/
AllTracker: Efficient Dense Point Tracking at High Resolution If you're using any point tracker in any project, this is likely a drop-in upgrade—improving speed, accuracy, and density, all at once.
📢 Introducing DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models Compared to vanilla DPO, we improve paired data construction and preference label granularity, leading to better visual quality and motion strength with only 1/3 of the data. 🧵
Most video models 🤯forget the past 🐌slow down over time 🔁rely on bidirectional (not causal) attention Our state-space video world models (SSM) 🧠remember across hundreds of frames ⚡️generate at constant speed ⏩is fully causal, enabling real-time rollout 1/3
More cool videos🔥 and details available on our website: dsaurus.github.io/isa4d/
Video generation of humans with control over body pose and facial expressions is crucial for a plethora of applications. Towards this goal, we introduce a new interspatial attention (ISA) mechanism as a scalable building block for DiT–based video generation models #SIGGRAPH2025
Video generation of humans with control over body pose and facial expressions is crucial for a plethora of applications. Towards this goal, we introduce a new interspatial attention (ISA) mechanism as a scalable building block for DiT–based video generation models #SIGGRAPH2025
Curious about how cities have changed in the past decade? We use MLLMs to analyse 40 million Street View images to answer this. Do you know that "juice shops became a thing in NYC" and "miles of overpasses were painted BLUE in SF"? More at→boyangdeng.com/visual-chronic… (vid ↓ w/ 🔊)
Excited to share our work: Gaussian Mixture Flow Matching Models (GMFlow) github.com/lakonik/gmflow GMFlow generalizes diffusion models by predicting Gaussian mixture denoising distributions, enabling precise few-step sampling and high-quality generation.
Introduce CoT-VLA – Visual Chain-of-Thought reasoning for Robot Foundation Models! 🤖 By leveraging next-frame prediction as visual chain-of-thought reasoning, CoT-VLA uses future prediction to guide action generation and unlock large-scale video data for training. #CVPR2025
🔥Want to capture 3D dancing fluids♨️🌫️🌪️💦? No specialized equipment, just one video! Introducing FluidNexus: Now you only need one camera to reconstruct 3D fluid dynamics and predict future evolution! 🧵1/4 Web: yuegao.me/FluidNexus/ Arxiv: arxiv.org/pdf/2503.04720
🏡Building realistic 3D scenes just got smarter! Introducing our #CVPR2025 work, 🔥FirePlace, a framework that enables Multimodal LLMs to automatically generate realistic and geometrically valid placements for objects into complex 3D scenes. How does it work?🧵👇
Can robots leverage their entire body to sense and interact with their environment, rather than just relying on a centralized camera and end-effector? Introducing RoboPanoptes, a robot system that achieves whole-body dexterity through whole-body vision. robopanoptes.github.io
🤖 Introducing Human-Object Interaction from Human-Level Instructions! First complete system that generates physically plausible, long-horizon human-object interactions with finger motions in contextual environments, driven by human-level instructions. 🔍 Our approach: - LLMs…
Do large multimodal models understand how to make dresses for your winter holiday party💃? We introduce AIpparel, a vision-language-garment model capable of generating and editing simulation-ready sewing patterns from text and images. Project page at georgenakayama.github.io/AIpparel/.…