Jason Y. Zhang
@jasonyzhang2
3D @ Google. PhD @CMU_robotics.
Video, meet audio. 🎥🤝🔊 With Veo 3, our new state-of-the-art generative video model, you can add soundtracks to clips you make. Create talking characters, include sound effects, and more while developing videos in a range of cinematic styles. 🧵
For everyone interested in precise 📷camera control 📷 in transformers [e.g., video / world model etc] Stop settling for Plücker raymaps -- use camera-aware relative PE in your attention layers, like RoPE (for LLMs) but for cameras! Paper & code: liruilong.cn/prope/
Bolt3D is accepted to @ICCVConference 🥳 see you in Hawaii!
⚡️ Introducing Bolt3D ⚡️ Bolt3D generates interactive 3D scenes in less than 7 seconds on a single GPU from one or more images. It features a latent diffusion model that *directly* generates 3D Gaussians of seen and unseen regions, without any test time optimization. 🧵👇 (1/9)
image ⇒ video ⇒ 3D/4D I'm super excited to build the next generation of models that understand and can imagine the world like we do at SpAItial with amazing people. Sounds fun? We are hiring! spaitial.ai
🚀🚀🚀Announcing our $13M funding round to build the next generation of AI: 𝐒𝐩𝐚𝐭𝐢𝐚𝐥 𝐅𝐨𝐮𝐧𝐝𝐚𝐭𝐢𝐨𝐧 𝐌𝐨𝐝𝐞𝐥𝐬 that can generate entire 3D environments anchored in space & time. 🚀🚀🚀 Interested? Join our world-class team: 🌍 spaitial.ai #GenAI #3DAI
Veo3 is out! deepmind.google/models/veo/ This model is awesome! It now generates audio as well as video. I'm really impressed by the background audio and music, and the synchronization of sound effects to the video. Try it out using Flow! labs.google/flow/about
Reference-powered Veo lets you go for walks in the Himalayas with your dog!
Here's a nice "proof without words": The sum of the squares of several positive values can never be bigger than the square of their sum. This picture helps make sense of how ℓ₁ and ℓ₂ norms regularize and sparsify solutions (resp.). [1/n]
Excited to announce our latest #ICLR work on long-context/relational reasoning evaluation for LLMs ReCogLab! openreview.net/pdf?id=yORSk4Y… github.com/google-deepmin… Work with Andrew Liu, @priorupdates @gargi_balasu @neuro_kim and others at @GoogleDeepMind
Introducing VGGT (CVPR'25), a feedforward Transformer that directly infers all key 3D attributes from one, a few, or hundreds of images, in seconds! No expensive optimization needed, yet delivers SOTA results for: ✅ Camera Pose Estimation ✅ Multi-view Depth Estimation ✅ Dense…