Yixuan Wang
@YXWangBot
CS Ph.D. student @Columbia & Intern @AIatMeta | Prev. Boston Dynamics AI Institute, Google X #Vision #Robotics #Learning
🤔Active robot exploration is critical but hard – long-horizon, large space, and complex occlusions. How can robot explore like human? 🤖Introducing CuriousBot, which interactively explores and builds actionable 3D relational object graph. 🔗curiousbot.theaiinstitute.com 👇Threads(1/9)
🚀 Introducing RIGVid: Robots Imitating Generated Videos! Robots can now perform complex tasks—pouring, wiping, mixing—just by imitating generated videos, purely zero-shot! No teleop. No OpenX/DROID/Ego4D. No videos of human demonstrations. Only AI generated video demos 🧵👇
I was really impressed by the UMI gripper (@chichengcc et al.), but a key limitation is that **force-related data wasn’t captured**: humans feel haptic feedback through the mechanical springs, but the robot couldn’t leverage that info, limiting the data’s value for fine-grained…
Tactile interaction in the wild can unlock fine-grained manipulation! 🌿🤖✋ We built a portable handheld tactile gripper that enables large-scale visuo-tactile data collection in real-world settings. By pretraining on this data, we bridge vision and touch—allowing robots to:…
It is soooooo awesome to see UMI + Tactile comes to life! I am very impressed how quickly the whole hardware + software system is built. Meanwhile, they even collected lots of the data in the wild! Very amazing work!!!
Tactile interaction in the wild can unlock fine-grained manipulation! 🌿🤖✋ We built a portable handheld tactile gripper that enables large-scale visuo-tactile data collection in real-world settings. By pretraining on this data, we bridge vision and touch—allowing robots to:…
TRI's latest Large Behavior Model (LBM) paper landed on arxiv last night! Check out our project website: toyotaresearchinstitute.github.io/lbm1/ One of our main goals for this paper was to put out a very careful and thorough study on the topic to help people understand the state of the…
Had a great time yesterday giving three invited talks at #RSS2025 workshops—on foundation models, structured world models, and tactile sensing for robotic manipulation. Lots of engaging conversations! One more talk coming up on Wednesday (6/25). Also excited to be presenting two…
Just arrived at LA and excited to be at RSS! I will present CodeDiffuser at following sessions: - Presentation on June 22 (Sun.) 5:30 PM - 6:30 PM - Poster on June 22 (Sun.) 6:30 PM - 8:00 PM I will also present CuriousBot at - FM4RoboPlan Workshop on June 21 (Sat.) 9:40 - 10:10…
🤖 Does VLA models really listen to language instructions? Maybe not 👀 🚀 Introducing our RSS paper: CodeDiffuser -- using VLM-generated code to bridge the gap between **high-level language** and **low-level visuomotor policy** 🎮 Try the live demo: robopil.github.io/code-diffuser/ (1/9)
We’ve been exploring 3D world models with the goal of finding the right recipe that is both: (1) structured—for sample efficiency and generalization (my personal emphasis), and (2) scalable—as we increase real-world data collection. With **Particle-Grid Neural Dynamics**…
Can we learn a 3D world model that predicts object dynamics directly from videos? Introducing Particle-Grid Neural Dynamics: a learning-based simulator for deformable objects that trains from real-world videos. Website: kywind.github.io/pgnd ArXiv: arxiv.org/abs/2506.15680…
How can we achieve both common sense understanding that can deal with varying levels of ambiguity in language and dextrous manipulation? Check out CodeDiffuser, a really neat work that bridges Code Gen with a 3D Diffusion Policy! This was a fun project with cool experiments! 🤖
🤖 Does VLA models really listen to language instructions? Maybe not 👀 🚀 Introducing our RSS paper: CodeDiffuser -- using VLM-generated code to bridge the gap between **high-level language** and **low-level visuomotor policy** 🎮 Try the live demo: robopil.github.io/code-diffuser/ (1/9)
Check out the cool results and demo!
🤖 Does VLA models really listen to language instructions? Maybe not 👀 🚀 Introducing our RSS paper: CodeDiffuser -- using VLM-generated code to bridge the gap between **high-level language** and **low-level visuomotor policy** 🎮 Try the live demo: robopil.github.io/code-diffuser/ (1/9)
Two releases in a row from our lab today 😆 One problem I was always pondering on is how to use structured representation while making it scalable. Super excited that Kaifeng's work pushes this direction forward and I cannot wait to see what's more in the future!!
Can we learn a 3D world model that predicts object dynamics directly from videos? Introducing Particle-Grid Neural Dynamics: a learning-based simulator for deformable objects that trains from real-world videos. Website: kywind.github.io/pgnd ArXiv: arxiv.org/abs/2506.15680…
**Steerability** remains one of the key issues for current vision-language-action models (VLAs). Natural language is often ambiguous and vague: "Hang a mug on a branch" vs "Hang the left mug on the right branch." Many works claim to handle language input, yet the tasks are often…
🤖 Does VLA models really listen to language instructions? Maybe not 👀 🚀 Introducing our RSS paper: CodeDiffuser -- using VLM-generated code to bridge the gap between **high-level language** and **low-level visuomotor policy** 🎮 Try the live demo: robopil.github.io/code-diffuser/ (1/9)
It is cool to see that you can steer your low-level policy with foundation models. Check out new work from @YXWangBot
🤖 Does VLA models really listen to language instructions? Maybe not 👀 🚀 Introducing our RSS paper: CodeDiffuser -- using VLM-generated code to bridge the gap between **high-level language** and **low-level visuomotor policy** 🎮 Try the live demo: robopil.github.io/code-diffuser/ (1/9)
Ep#10 with @RogerQiu_42 on Humanoid Policy ~ Human Policy human-as-robot.github.io Co-hosted by @chris_j_paxton & @micoolcho