Jiuhai Chen
@JiuhaiC
CS Phd student @ UMD Ex-intern @Meta @Microsoft @Amazon On the industry job market
🚀 Introducing BLIP3-o: A Family of Fully Open Unified Multimodal Models arxiv.org/pdf/2505.09568 🔓 Attempting to unlock GPT-4o’s image generation. Open source everything! Including 25 million pre-training data!


Super excited to attend #CVPR2025 in person! Catch our spotlight talk on BLIP3-o at the Computer Vision in the Wild workshop 👉 computer-vision-in-the-wild.github.io/cvpr-2025/ Also check out Florence-VL at poster #372, Sunday 10:30–12:30
🌊Tried BLIP3-o? Our family of unified multimodal models is making waves, now open-sourced for the AI Research community. 🔓 Github Repo: bit.ly/4muUBzm 🤗 Models: bit.ly/4kB9oXK 🪧 Demo: bit.ly/4jb0YVD 📰 News: bit.ly/3Z1tuC8 ✍️ Blog:…
Check out our BLIP3-o as the notable AI models of the week!
10 notable AI models of the week: ▪️ Aya Vision ▪️ INTELLECT-2 ▪️ MiniMax-Speech ▪️ SWE-1 ▪️ Seed1.5-VL ▪️ BLIP3-o ▪️ Skywork-VL ▪️ Behind Maya ▪️ MiMo ▪️ AM-Thinking-v1 🧵
Introducing 🔥BLIP3-o🔥 -- A Family of Fully Open Unified Multimodal Models for Both Image Understanding and Image Understanding 📊Paper: arxiv.org/pdf/2505.09568 🤗Models and Datasets: huggingface.co/BLIP3o 🧠Code: github.com/JiuhaiChen/BLI… 💻Demo: blip3o.salesforceresearch.ai We…
Our gradio demo for BLIP3-o: huggingface.co/spaces/BLIP3o/… using the open-source checkpoint: huggingface.co/BLIP3o/BLIP3o-…
Our first attempt at unlocking GPT-4o’s image generation — more to come in the next few weeks!
We find training unified multimodal understanding and generation models is so easy, you do not need to tune MLLMs at all. MLLM's knowledge/reasoning/in-context learning can be transferred from multimodal understanding (text output) to generation (pixel output) even it is FROZEN!
Florence-VL is accepted by #CVPR2025 . Thanks for all coauthors! BTW, a very powerful multimodal for image understanding & generation will come soon, stay tuned ! 🚀🔥
🚨 New VLM Paper ! Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion 1️⃣ Are CLIP-style vision transformers the best vision encoder for VLMs? We explore new possibilities with Florence-2, a generative vision foundation model,…
🚨 New VLM Paper ! Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion 1️⃣ Are CLIP-style vision transformers the best vision encoder for VLMs? We explore new possibilities with Florence-2, a generative vision foundation model,…
This project really changed how I think about multimodal models and LLMs. I used to believe that multimodal (visual) prediction required significant changes to the model and heavy pretraining, like Chameleon. But surprisingly, the opposite is true! In large autoregressive models,…
How far is an LLM from not only understanding but also generating visually? Not very far! Introducing MetaMorph---a multimodal understanding and generation model. In MetaMorph, understanding and generation benefit each other. Very moderate generation data is needed to elicit…
Try our Florence-VL demo !
📣 Microsoft Research releases Florence-VL, a new family of MLLMs powered by the generative vision foundation model Florence-2. Achieves significant improvements in general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, and more🔥Learn more👇