Sicheng Mo
@sicheng_mo
Graduate student at UCLA. Interested in ML and CV.
#ICCV2025 Introducing X-Fusion: Introducing New Modality to Frozen Large Language Models It is a novel framework that adapts pretrained LLMs (e.g., LLaMA) to new modalities (e.g., vision) while retaining their language capabilities and world knowledge! (1/n) Project Page:…
What exactly is a "world model"? And what limits existing video generation models from being true world models? In my new blog post, I argue that a true video world model must be causal, interactive, persistent, real-time, and physical accurate. xunhuang.me/blogs/world_mo…
@CVPR is around the corner!! Join us at the Workshop on T4V at #CVPR2025 with a great speaker lineup (@MikeShou1, @jw2yang4ai, @WenhuChen, @roeiherzig, Yuheng Li, Kristen Grauman) covering diverse topics! Website: sites.google.com/view/t4v-cvpr2… #CVPR #Transformer #Vision #T4V2025 #T4V
Join our #CVPR2025 Workshop on Real2Sim: Bridging the Gap between Neural Rendering and Robot Learning on 6/12! With amazing speakers: @drmapavone @shahdhruv_ @GordonWetzstein @LingjieLiu1 @sicheng_mo @RuohanZhang76 @carlo_sferrazza ⏲️ Thu, 6/12, 1:45-5:30 PM CDT 🏢 Davidson…
✨New preprint: Dual-Process Image Generation! We distill *feedback from a VLM* into *feed-forward image generation*, at inference time. The result is flexible control: parameterize tasks as multimodal inputs, visually inspect the images with the VLM, and update the generator.🧵
Oh great, it is accepted by #ICLR2025 as Spotlight paper!
We release a new urban simulator, MetaUrban, to support research on AI agents for micromobility. The work will be presented at #ICLR2025, and the demo code can run on any laptop. Webpage: metadriverse.github.io/metaurban/ Code: github.com/metadriverse/m… Paper: arxiv.org/pdf/2407.08725
Mitigating racial bias from LLMs is a lot easier than removing it from humans! Can’t believe this happened at the best AI conference @NeurIPSConf We have ethical reviews for authors, but missed it for invited speakers? 😡
Come by our poster tomorrow :D
Stop by our Ctrl-X poster this week at #NeurIPS2024 :D Wednesday Nov. 11th 4:30–7:30PM, East Exhibit Hall A–C #1605 Come say hi!
🚀 Introducing CAT4D! 🚀 CAT4D transforms any real or generated video into dynamic 3D scenes with a multi-view video diffusion model. The outputs are dynamic 3D models that we can freeze and look at from novel viewpoints, in real-time! Be sure to try our interactive viewer!
🚀 I am recruiting PhD students for Fall 2025 at the UCLA Robot Intelligence Lab! 🤖 If you are interested in robot learning and human-robot interaction, mark me as a potential adivisor when you apply to the UCLA CS PhD program! #PhD #Robotics @CS_UCLA
Ctrl-X was accepted to #NeurIPS2024! We present a guidance-free structure and appearance control method for any pre-trained diffusion model. Paper, code, and results: genforce.github.io/ctrl-x It was awesome collaborating with @sicheng_mo @BenKlingher Fangzhou Mu @zhoubolei :D
Very excited to get this out: “DVT: Denoising Vision Transformers”. We've identified and combated those annoying positional patterns in many ViTs. Our approach denoises them, achieving SOTA results and stunning visualizations! Learn more on our website: jiawei-yang.github.io/DenoisingViT/
In case you were wondering what’s going on with the back of the #CVPR2024 T-shirt: it’s a hybrid image made by @invernopark and @dangengdg! When you look at it up close, you’ll just see the Seattle skyline, but when you view it from a distance, the text “CVPR” should appear.
So many new LLM architectures (Mambas🐍, Transformers🤖,🦙,🦔, Hyenas🐺,🦓…), so little GPU time to combine them into hybrid LLMs… Good news! Today we release Manticore, a system for creating **pretrained hybrids** from pretrained models! 👨🌾🦁🦂 arxiv.org/pdf/2406.00894 1/n