David Fan
@DavidJFan
Facebook AI Research (FAIR) | Video Representations, Self-Supervised Learning | @Princeton Computer Science '19
Can visual SSL match CLIP on VQA? Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.

Come by the poster if you want recommendations on cool restaurants to try in Vancouver 😃!
[#ICML2025] Have you ever wanted to train LLMs on distributed private data but were blocked by model size or privacy constraints 😔? Here’s a solution: Introducing 🌸POPri (Policy Optimization for Private Data)! Poster 🗓️ today at 4:30pm PT, 📍East Exhibition Hall A-B E-1006
Congrats @jianyuan_wang and co!!!
Many Congratulations to @jianyuan_wang, @MinghaoChen23, @n_karaev, Andrea Vedaldi, Christian Rupprecht and @davnov134 for winning the Best Paper Award @CVPR for "VGGT: Visual Geometry Grounded Transformer" 🥇🎉 🙌🙌 #CVPR2025!!!!!!
It was a pleasure to work with the team on this! Looking forward to further improving the ability to learn from and predict in the visual world.
Many Congratulations to @jianyuan_wang, @MinghaoChen23, @n_karaev, Andrea Vedaldi, Christian Rupprecht and @davnov134 for winning the Best Paper Award @CVPR for "VGGT: Visual Geometry Grounded Transformer" 🥇🎉 🙌🙌 #CVPR2025!!!!!!
Welcome Rob! So blessed to have you steer the ship! See you around the office :)
1/ Excited to share that I’m taking on the role of leading Fundamental AI Research (FAIR) at Meta. Huge thanks to Joelle for everything. Look forward to working closely again with Yann & team.
Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦 arxiv.org/abs/2504.21850 1/10
Excited to release the training code for MetaMorph! MetaMorph offers a simple yet effective way to convert LLMs into a multimodal LLM that not only takes multimodal inputs, but also generates multimodal outputs via AR prediction. This confers the ability to “think visually”, and…
We're open-sourcing the training code for MetaMorph! MetaMorph offers a lightweight framework for turning LLMs into unified multimodal models: (multimodal) tokens -> transformers -> diffusion -> pixel! This is our best take on unified modeling as of November 2024, and…
Excited to share that our paper on Navigation World Models was selected for an Oral presentation at CVPR! Code & models: github.com/facebookresear… huggingface.co/facebook/nwm
Happy to share our new work on Navigation World Models! 🔥🔥 Navigation is a fundamental skill of agents with visual-motor capabilities. We train a single World Model across multiple environments and diverse agent data. w/ @GaoyueZhou, Danny Tran, @trevordarrell and @ylecun.
New paper - Transformers, but without normalization layers (1/n)