Alex Trevithick
@alextrevith
Research Scientist @NVIDIAAI. PhD @UCSanDiego. 4D Vision, Machine Learning, Generative Models.
🚀 Introducing SimVS: our new method that simplifies 3D capture! 🎯 3D reconstruction assumes consistency—no dynamics or lighting changes—but reality constantly breaks this assumption. ✨ SimVS takes a set of inconsistent images and makes them consistent with a chosen frame.
CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models @ChrisWu6080 @RuiqiGao @poolio @alextrevith ChangxiZheng @jon_barron @holynski_
Poster #60 this afternoon, swing by!
🚀 Introducing SimVS: our new method that simplifies 3D capture! 🎯 3D reconstruction assumes consistency—no dynamics or lighting changes—but reality constantly breaks this assumption. ✨ SimVS takes a set of inconsistent images and makes them consistent with a chosen frame.
Interactive looong-context reasoning still has a long way to go. We need progress across all axes: more data, bigger model, and smarter architectures. ∞-THOR is just beginning: generate ∞-len trajectories, run agents online train with feedback and more! Let’s push the limits🚀
"Foundation" models for embodied agents are all the rage but how to actually do complex looong context reasoning? Can we scale Beyond Needle(s) in the (Embodied) Haystack? ∞-THOR is an infinite len sim framework + guide on (new) architectures/training methods for VLA models
Supervised learning has held 3D Vision back for too long. Meet RayZer — a self-supervised 3D model trained with zero 3D labels: ❌ No supervision of camera & geometry ✅ Just RGB images And the wild part? RayZer outperforms supervised methods (as 3D labels from COLMAP is noisy)…
What's the difference between the oai and google image generators? Giving both of them the same image and prompt "generate this image" Gemini is essentially the identity function whereas oai changes content. Does this indicate continuous encoder for Gemini vs. VQVAE for oai?


🦣Easi3R: 4D Reconstruction Without Training! Limited 4D datasets? Take it easy. #Easi3R adapts #DUSt3R for 4D reconstruction by disentangling and repurposing its attention maps → make 4D reconstruction easier than ever! 🔗Page: easi3r.github.io
⚡️ Introducing Bolt3D ⚡️ Bolt3D generates interactive 3D scenes in less than 7 seconds on a single GPU from one or more images. It features a latent diffusion model that *directly* generates 3D Gaussians of seen and unseen regions, without any test time optimization. 🧵👇 (1/9)
Thanks @_akhaliq for sharing our ReCamMaster! ReCamMaster can re-capture existing videos with novel camera trajectories. Project page: jianhongbai.github.io/ReCamMaster/ Paper: huggingface.co/papers/2503.11…
ReCamMaster Camera-Controlled Generative Rendering from A Single Video
Introducing VGGT (CVPR'25), a feedforward Transformer that directly infers all key 3D attributes from one, a few, or hundreds of images, in seconds! No expensive optimization needed, yet delivers SOTA results for: ✅ Camera Pose Estimation ✅ Multi-view Depth Estimation ✅ Dense…
As one of the people who popularized the field of diffusion models, I am excited to share something that might be the “beginning of the end” of it. IMM has a single stable training stage, a single objective, and a single network — all are what make diffusion so popular today.
Today, we release Inductive Moment Matching (IMM): a new pre-training paradigm breaking the algorithmic ceiling of diffusion models. Higher sample quality. 10x more efficient. Single-stage, single network, stable training. Read more: lumalabs.ai/news/imm
I just pushed a new paper to arXiv. I realized that a lot of my previous work on robust losses and nerf-y things was dancing around something simpler: a slight tweak to the classic Box-Cox power transform that makes it much more useful and stable. It's this f(x, λ) here:
The raw chain of thought from DeepSeek is fascinating, really reads like a human thinking out loud. Charming and strange.
Preprint of the day: Asim et al., "MEt3R: Measuring Multi-View Consistency in Generated Images" -- geometric-rl.mpi-inf.mpg.de/met3r/ Lots of diffusion-based solutions for novel-view synthesis recently, but how good are they? A metric to compare how "3D" they truly are.
Excited to finally share this work w/ @SuryaGanguli. Tl;dr: we find the first closed-form analytical theory that replicates the outputs of the very simplest diffusion models, with median pixel wise r^2 values of 90%+. arxiv.org/abs/2412.20292
Training-free Video Enhancement: Achieved 🎉 Nice work with @oahzxl @shaowenqi126301 @VictorKaiWang1 @VitaGroupUT @YangYou1991 et al. Non-trivial enhancement, training-free, and plug-and-play 🥳 Blog: oahzxl.github.io/Enhance_A_Vide… (🧵1/6)
You know Generative 3D is moving fast when "early methods" were arXived 8 months ago 😂 [41] Realmdreamer: Text-driven 3d scene generation with inpainting and depth diffusion." arXiv:2404.07199, April 10, 2024.
![alextrevith's tweet image. You know Generative 3D is moving fast when "early methods" were arXived 8 months ago 😂
[41] Realmdreamer: Text-driven 3d scene generation with inpainting and depth diffusion." arXiv:2404.07199, April 10, 2024.](https://pbs.twimg.com/media/GfPpLVHXsAEHA69.png)
Introducing MASt3R-SLAM, the first real-time monocular dense SLAM with MASt3R as a foundation. Easy to use like DUSt3R/MASt3R, from an uncalibrated RGB video it recovers accurate, globally consistent poses & a dense map. With @eric_dexheimer*, @AjdDavison (*Equal Contribution)