Xichen Pan
@xichen_pan
CS Ph.D. Student @NYU_Courant, Visiting Researcher @metaai | Prev: @MSFTResearch, @AlibabaGroup, http://horizon.cc, @sjtu1896
We find training unified multimodal understanding and generation models is so easy, you do not need to tune MLLMs at all. MLLM's knowledge/reasoning/in-context learning can be transferred from multimodal understanding (text output) to generation (pixel output) even it is FROZEN!

Thanks for bringing this to my attention. I honestly wasn’t aware of the situation until the recent posts started going viral. I would never encourage my students to do anything like this—if I were serving as an Area Chair, any paper with this kind of prompt would be…
metaquery is now open-source — with both the data and code available.
The code and instruction-tuning data for MetaQuery are now open-sourced! Code: github.com/facebookresear… Data: huggingface.co/collections/xc… Two months ago, we released MetaQuery, a minimal training recipe for SOTA unified understanding and generation models. We showed that tuning few…
The code and instruction-tuning data for MetaQuery are now open-sourced! Code: github.com/facebookresear… Data: huggingface.co/collections/xc… Two months ago, we released MetaQuery, a minimal training recipe for SOTA unified understanding and generation models. We showed that tuning few…
We find training unified multimodal understanding and generation models is so easy, you do not need to tune MLLMs at all. MLLM's knowledge/reasoning/in-context learning can be transferred from multimodal understanding (text output) to generation (pixel output) even it is FROZEN!
Heading to #NeurIPS2024 to present Cambrian-1 w/ @TongPetersb! Catch our oral presentation Friday @ 10am (Oral 5C) and our poster afterwards until 2pm (#3700 in East Hall A-C) 🪼🎉
🚨 New VLM Paper ! Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion 1️⃣ Are CLIP-style vision transformers the best vision encoder for VLMs? We explore new possibilities with Florence-2, a generative vision foundation model,…
SV3D takes an image as input and outputs camera-controlled novel views that are highly consistent across the views. We also propose techniques to convert these novel views into quality 3D meshes. View synthesis models are publicly released. Project page: sv3d.github.io
Today, we are releasing Stable Video 3D, a generative model based on Stable Video Diffusion. This new model advances the field of 3D technology, delivering greatly improved quality and multi-view. The model is available now for commercial and non-commercial use with a Stability…
TLDR: Meet ✨Lumiere✨ our new text-to-video model from @GoogleAI! Lumiere is designed to create entire clips in just one go! Seamlessly opening up possibilities for many applications: Image-to-video 🖼️ Stylized generation 🖌️ Video editing 🪩 and beyond. See 🧵👇
Introducing Scalable Interpolant Transformer! SiT integrates a flexible interpolant framework into DiT, enabling a nuanced exploration of dynamical transport in image generation. With an FID of 2.06 on ImageNet 256, SiT pushes Interpolant-based models to new heights! (1/n)
Delighted to announce that our Kosmos-G has been accepted by ICLR 2024. Thanks my mentor @donglixp at @MSFTResearch. We are working on integrating Kosmos-G into diffusers. Looking forward to meeting you in Vienna!
Check out our work on zero-shot subject driven generation. Now, prompt stable diffusion using not only text, but also images! Speed close to original SD Project Page: xichenpan.com/kosmosg/ Code: aka.ms/Kosmos-G
One thing I really like about #ICLR is that their review data are open to everyone! 🥰 Same as last year, I collected #ICLR2024 review data, modifying my previous script for OpenReview APIv2 Here are the histograms based on 7,331 submissions Hope this helps!🙋