Wenlong Huang
@wenlong_huang
PhD Student @StanfordSVL @StanfordAILab. Previously @Berkeley_AI @GoogleDeepMind. Robotics, Foundation Models.
What structural task representation enables multi-stage, in-the-wild, bimanual, reactive manipulation? Introducing ReKep: LVM to label keypoints & VLM to write keypoint-based constraints, solve w/ optimization for diverse tasks, w/o task-specific training or env models. 🧵👇
🚀 Introducing RIGVid: Robots Imitating Generated Videos! Robots can now perform complex tasks—pouring, wiping, mixing—just by imitating generated videos, purely zero-shot! No teleop. No OpenX/DROID/Ego4D. No videos of human demonstrations. Only AI generated video demos 🧵👇
🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n
Excited that @RuohanZhang76 is joining NU @northwesterncs ! If you are thinking about pursuing a PhD, definitely reach out to him! During my wonderful year at @StanfordAILab @StanfordSVL, when I was completely new to robotics, he was the nicest person who was incredibly patient…
📢 Beginning this fall, four new tenure-track, clinical, and visiting faculty members will join our department! 📢 We are thrilled to welcome Shaddin Dughmi, Sidhanth Mohanty, Lydia Tse, and Ruohan Zhang! Meet the newest members of our team: spr.ly/6019fGdXv
Tactile interaction in the wild can unlock fine-grained manipulation! 🌿🤖✋ We built a portable handheld tactile gripper that enables large-scale visuo-tactile data collection in real-world settings. By pretraining on this data, we bridge vision and touch—allowing robots to:…
TRI's latest Large Behavior Model (LBM) paper landed on arxiv last night! Check out our project website: toyotaresearchinstitute.github.io/lbm1/ One of our main goals for this paper was to put out a very careful and thorough study on the topic to help people understand the state of the…
Exciting to see more works leveraging VLM-inferred keypoints as a bridge between semantic knowledge and low-level behaviors, especially those dexterous skills 🤩
We find keypoint trajectories to be a powerful interface between VLM planning & RL control VLM: Generates object + hand motion plan from a task prompt & RGB-D image (perception + commonsense) RL policy: Conditioned on the plan, learns low-level dexterous control (0-shot sim2real)
Can VLMs build Spatial Mental Models like humans? Reasoning from limited views? Reasoning from partial observations? Reasoning about unseen objects behind furniture / beyond current view? Check out MindCube! 🌐mll-lab-nu.github.io/mind-cube/ 📰arxiv.org/pdf/2506.21458…
“As a PHD student, your job is not publishing a paper every quarter. Focus on a problem in deep understanding and solve it in years under the protect of your adviser” from @RussTedrake #RSS2025
Tesla Robotaxi: A New Era Begins I’ve (very fortunately) been part of multiple robotaxi launches. But this one is different and feels much more profound. It’s a paradigm shift. It’s the GPT moment for real-world autonomy. Tesla’s robotaxi runs vision-only -- no lidar, no radar,…
The future of transportation is here with Tesla robotaxi
Attending RSS for the first time and giving a talk tomorrow at the Learning Structured World Models for Robotic Manipulation workshop! At midnight, I made a last-minute crazy decision to change my talk content to Virtual Community — to honor the incredible hard work of my…
World Simulator, reimagined — now alive with humans, robots, and their vibrant society unfolding in 3D real-world geospatial scenes across the globe! 🚀 One day soon, humans and robots will co-exist in the same world. To prepare, we must address: 1️⃣ How can robots cooperate or…
Join us tomorrow in SGM 124 for the SWOMO workshop at #RSS2025! We will have 6 amazing talks and a panel in the end to discuss structured world modeling for robotics! Latest schedule and information at swomo-rss.github.io
Excited to announce the “Structured World Models for Robotic Manipulation” workshop at #RSS2025 in LA! Website: swomo-rss.github.io Call for Papers (Deadline: May 23): swomo-rss.github.io/index.html#call Come join us with a stellar lineup of speakers to discuss the various important &…
Can we learn a 3D world model that predicts object dynamics directly from videos? Introducing Particle-Grid Neural Dynamics: a learning-based simulator for deformable objects that trains from real-world videos. Website: kywind.github.io/pgnd ArXiv: arxiv.org/abs/2506.15680…
🤖 Does VLA models really listen to language instructions? Maybe not 👀 🚀 Introducing our RSS paper: CodeDiffuser -- using VLM-generated code to bridge the gap between **high-level language** and **low-level visuomotor policy** 🎮 Try the live demo: robopil.github.io/code-diffuser/ (1/9)
Your bimanual manipulators might need a Robot Neck 🤖🦒 Introducing Vision in Action: Learning Active Perception from Human Demonstrations ViA learns task-specific, active perceptual strategies—such as searching, tracking, and focusing—directly from human demos, enabling robust…
Today we're excited to share a glimpse of what we're building at Generalist. As a first step towards our mission of making general-purpose robots a reality, we're pushing the frontiers of what end-to-end AI models can achieve in the real world. Here's a preview of our early…
Today is the day! Welcome to join @CVPR workshop on Foundation Models meet Embodied Agents! 🗓️Jun 11 📍Room 214 🌐…models-meet-embodied-agents.github.io/cvpr2025/ Looking forward to learning insights from wonderful speakers @JitendraMalikCV @RanjayKrishna @KaterinaFragiad @ShuangL13799063 @du_yilun…
I always found it puzzling how language models learn so much from next-token prediction, while video models learn so little from next frame prediction. Maybe it's because LLMs are actually brain scanners in disguise. Idle musings in my new blog post: sergeylevine.substack.com/p/language-mod…
Very impressed with Veo 3 and all the things people are finding on r/aivideo etc. Makes a big difference qualitatively when you add audio. There are a few macro aspects to video generation that may not be fully appreciated: 1. Video is the highest bandwidth input to brain. Not…
It's been only a day since Google dropped Veo 3. The new model creates video and audio simultaneously from a single prompt! Here are 13 wild examples so far: 1. Self-aware AI characters
Language-conditioned policy is kind of boring until we can have sensorimotor data that can reach (even a fraction) of the diversity. Before that, language is just one-hot task encoding.
Two days into #ICRA2025 @ieee_ras_icra—great connecting with folks! Gave a talk, moderated a panel, and got a *Best Paper Award* 🏆 at the workshops. Up next: four papers and two more workshop talks/panels. Excited to chat robot learning and the road to general intelligence! 🤖