Rahul Venkatesh
@Rahul_Venkatesh
CS Ph.D. student at Stanford @NeuroAILab @StanfordAILab
AI models segment scenes based on how things appear, but babies segment based on what moves together. We utilize a visual world model that our lab has been developing, to capture this concept — and what's cool is that it beats SOTA models on zero-shot segmentation and physical…
Very interesting... a way of thinking about scene segmentation that is both more functional for robotics and also category-agnostic (which is more cognitively-grounded as babies get few category labels), and new models!
AI models segment scenes based on how things appear, but babies segment based on what moves together. We utilize a visual world model that our lab has been developing, to capture this concept — and what's cool is that it beats SOTA models on zero-shot segmentation and physical…
Amazing to see all the things @NeuroAILab is doing with counterfactuals + a single "pure" vision foundation model, LRAS. Self-supervised segmentation is my favorite. It gets at a deep philosophical question: what is an object, anyway?
AI models segment scenes based on how things appear, but babies segment based on what moves together. We utilize a visual world model that our lab has been developing, to capture this concept — and what's cool is that it beats SOTA models on zero-shot segmentation and physical…
Here's a third application of our new world modeling technology - to object grouping. In a sense this completes the video scene understanding trifecta of 3D shape, motion, and now object individualization. From a technical perspective, the core innovation is the idea of…
AI models segment scenes based on how things appear, but babies segment based on what moves together. We utilize a visual world model that our lab has been developing, to capture this concept — and what's cool is that it beats SOTA models on zero-shot segmentation and physical…
what are objects, though? seriously, if i ask you to define where one object begins and another one ends would you have a good answer? is my phone case part of my phone? is my shirt part of my body? maybe it is based on whether i can take it apart and put it back together?…
AI models segment scenes based on how things appear, but babies segment based on what moves together. We utilize a visual world model that our lab has been developing, to capture this concept — and what's cool is that it beats SOTA models on zero-shot segmentation and physical…
Over the past 18 months my lab has been developing a new approach to visual world modeling. There will be a magnum opus that ties it all together out in the next couple of weeks. But for now there are some individual application papers that have poked out.
📷 New Preprint: SOTA optical flow extraction from pre-trained generative video models! While it seems intuitive that video models grasp optical flow, extracting that understanding has proven surprisingly elusive.
📷 New Preprint: SOTA optical flow extraction from pre-trained generative video models! While it seems intuitive that video models grasp optical flow, extracting that understanding has proven surprisingly elusive.
We prompt a generative video model to extract state-of-the-art optical flow, using zero labels and no fine-tuning. Our method, KL-tracing, achieves SOTA results on TAP-Vid & generalizes to challenging YouTube clips. @khai_loong_aw @KlemenKotar @CristbalEyzagu2 @lee_wanhee_…
We prompt a generative video model to extract state-of-the-art optical flow, using zero labels and no fine-tuning. Our method, KL-tracing, achieves SOTA results on TAP-Vid & generalizes to challenging YouTube clips. @khai_loong_aw @KlemenKotar @CristbalEyzagu2 @lee_wanhee_…
🚀 Excited to share our new paper! We introduce the first autoregressive model that natively handles: 🎥 Novel view synthesis 🎨 Interactive 3D object editing 📏 Depth extraction ➕ and more! No fine-tuning needed—just prompting. Outperforming even diffusion-based methods!
📷 New Preprint: SOTA optical flow extraction from pre-trained generative video models! While it seems intuitive that video models grasp optical flow, extracting that understanding has proven surprisingly elusive.