Sangwoo Mo
@sangwoomo
Postdoc @UMich. Past: PhD @kaist_ai, Intern @AIatMeta, @NVIDIAAI. Work on scalable priors for vision, language, and robotics.
Can scaling data and models alone solve computer vision? 🤔 Join us at the SP4V Workshop at #ICCV2025 in Hawaii to explore this question! 🎤 Speakers: @danfei_xu, @joaocarreira, @jiajunwu_cs, Kristen Grauman, @sainingxie, @vincesitzmann 🔗 sp4v.github.io

Beyond excited to share FlowMo! We found that the latent representations by video models implicitly encode motion information, and can guide the model toward coherent motion at inference time Very proud of @ariel__shaulov @itayhzn for this work! Plus, it’s open source! 🥳
🧵1/ Text-to-video models generate stunning visuals, but… motion? Not so much. You get extra limbs, objects popping in and out... In our new paper, we present FlowMo -- an inference-time method that reduces temporal artifacts without retraining or architectural changes. 👇
Excited to speak at this workshop! I think this topic is actually quite nuanced, and I'm excited to talk about our group's experience in trying to learn geometry and structure from data!
Can scaling data and models alone solve computer vision? 🤔 Join us at the SP4V Workshop at #ICCV2025 in Hawaii to explore this question! 🎤 Speakers: @danfei_xu, @joaocarreira, @jiajunwu_cs, Kristen Grauman, @sainingxie, @vincesitzmann 🔗 sp4v.github.io
Our computer vision textbook is now available for free online here: visionbook.mit.edu We are working on adding some interactive components like search and (beta) integration with LLMs. Hope this is useful and feel free to submit Github issues to help us improve the text!
Q-learning is not yet scalable seohong.me/blog/q-learnin… I wrote a blog post about my thoughts on scalable RL algorithms. To be clear, I'm still highly optimistic about off-policy RL and Q-learning! I just think we haven't found the right solution yet (the post discusses why).
Thank you to everyone who participated in our workshop!
Are Vision Foundation Models ready to tackle pixel-level tasks? 🖼️ Join us at the Pixel-level Vision Foundation Models (PixFoundation) Workshop at #CVPR2025! We’re excited to introduce an outstanding lineup of invited speakers. Meet them below 👇
As a video gaming company, @Krafton_AI has secretly been cooking something big with @NVIDIAAI for a while! 🥳 We introduce Orak, the first comprehensive video gaming benchmark for LLMs! arxiv.org/abs/2506.03610
Excited to present FastTD3: a simple, fast, and capable off-policy RL algorithm for humanoid control -- with an open-source code to run your own humanoid RL experiments in no time! Thread below 🧵
our new system trains humanoid robots using data from cell phone videos, enabling skills such as climbing stairs and sitting on chairs in a single policy (w/ @redstone_hong @junyi42 @davidrmcall)
We took a short break from robotics to build a human-level agent to play Competitive Pokémon. Partially observed. Stochastic. Long-horizon. Now mastered with Offline RL + Transformers. Our agent, trained on 475k+ human battles, hits the top 10% on Pokémon Showdown leaderboards.…
Can visual SSL match CLIP on VQA? Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.
In the past, we extended the convolution operator to go from low-level image processing to high-level visual reasoning. Can we also extend physical operators for more high-level physical reasoning? Introducing the Denoising Hamiltonian Network (DHN): arxiv.org/pdf/2503.07596
Best wishes for your ICCV submissions, and congrats again on your @CVPR papers! Please share your latest work with the workshop. CVPR dual submissions are allowed, so just reupload them. This is a last-minute call: the deadline is tomorrow! 🔥
Call for Papers: #CVPR2025 PixFoundation Workshop! Please share your accepted papers at CVPR and submissions at ICCV! 🔥 📅 Deadline (updated): March 9, 2025 sites.google.com/view/pixfounda…
Excited to share our work on the Diffusion Forcing Transformer—a flexible model that can generate videos from any number of images! We introduce History Guidance to boost quality, consistency, and dynamics, along with capabilities like OOD generalization and long stable rollouts!
Announcing Diffusion Forcing Transformer (DFoT), our new video diffusion algorithm that generates ultra-long videos of 800+ frames. DFoT enables History Guidance, a simple add-on to any existing video diffusion models for a quality boost. Website: boyuan.space/history-guidan… (1/7)