Peter Tong
@TongPetersb
Berkeley 23', CS PhD Student in NYU Courant advised by Professor @ylecun and Professor @sainingxie
This project really changed how I think about multimodal models and LLMs. I used to believe that multimodal (visual) prediction required significant changes to the model and heavy pretraining, like Chameleon. But surprisingly, the opposite is true! In large autoregressive models,…
How far is an LLM from not only understanding but also generating visually? Not very far! Introducing MetaMorph---a multimodal understanding and generation model. In MetaMorph, understanding and generation benefit each other. Very moderate generation data is needed to elicit…
Can diffusion models appear to be learning, when they’re actually just memorizing the training data? We show and investigate this phenomenon in the context of neural network weight generation, in our recent paper “Generative Modeling of Weights: Generalization or Memorization?"
🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46% 🌐 Website: multiverse4fm.github.io 🧵 1/n
CDS Prof. @ylecun and CDS-affiliated Asst. Prof. @sainingxie show that visual self-supervised models, when scaled to 7B parameters and trained on 2B images, can match or outperform CLIP on VQA tasks—without any language supervision. Read the paper: arxiv.org/abs/2504.01017
🔥Unlocking New Paradigm for Test-Time Scaling of Agents! We introduce Test-Time Interaction (TTI), which scales the number of interaction steps beyond thinking tokens per step. Our agents learn to act longer➡️richer exploration➡️better success Paper: arxiv.org/abs/2506.07976
🧵 1/7 Should AI agents "think more" or "do more"? 🤔 The current trend is to scale test-time compute, making agents generate longer reasoning traces. But what if that’s the wrong approach for interactive tasks? In our new work, we argue for a new scaling dimension: Test-Time…
🚀 Glad to see our All-Angles Bench (github.com/Chenyu-Wang567…) being adopted to evaluate 3D spatial understanding in Seed-1.5-VL-thinking along with OpenAI (o1) and Gemini 2.5 Pro..!
Introducing Seed-1.5-VL-thinking, the model achieves SOTA on 38 out of 60 VLM benchmarks🥳🥳🥳 github.com/ByteDance-Seed…
❗️❗️ Can MLLMs understand scenes from multiple camera viewpoints — like humans? 🧭 We introduce All-Angles Bench — 2,100+ QA pairs on multi-view scenes. 📊 We evaluate 27 top MLLMs, including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o. 🌐 Project: danielchyeh.github.io/All-Angles-Ben…
Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦 arxiv.org/abs/2504.21850 1/10
Web-SSL model weights are now available on GitHub and HuggingFace! You may use your favorite Transformers library API calls or load the model with native PyTorch - up to your preference. For more usage details, please see github.com/facebookresear… HuggingFace collection:…
Recently open-sourced projects from @TongPetersb, @DavidJFan, and the team at Meta FAIR. MetaMorph (training code and model weights): github.com/facebookresear… Web-SSL (model weights for Web-DINO and Web-MAE) github.com/facebookresear… FAIR's still leading the way in open research.
We are open-sourcing all the models in Web-SSL, from ViT-L to ViT-7B! It was super fun to train and play with these massive ViTs. Models: huggingface.co/collections/fa… Github: github.com/facebookresear… Huge credit to @DavidJFan for putting these models together!
Excited to release the training code for MetaMorph! MetaMorph offers a simple yet effective way to convert LLMs into a multimodal LLM that not only takes multimodal inputs, but also generates multimodal outputs via AR prediction. This confers the ability to “think visually”, and…
We're open-sourcing the training code for MetaMorph! MetaMorph offers a lightweight framework for turning LLMs into unified multimodal models: (multimodal) tokens -> transformers -> diffusion -> pixel! This is our best take on unified modeling as of November 2024, and…
New paper from FAIR+NYU: Q: Is language supervision required to learn effective visual representations for multimodal tasks? A: No. ⬇️⬇️⬇️
In Cambrian-1, we found that vision SSL representations usually lagged behind language-supervised ones -- but once the data gap is closed and scaling kicks in, performance catches up. We’ve tried scaling SSL before, but this is the first time I’ve seen real signal: SSL adapts to…
Can visual SSL match CLIP on VQA? Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.
[7/8] This side project started in October when @TongPetersb, @_amirbar, and I were thinking about the rise of CLIP as a popular vision encoder for MLLMs. The community often assumes that language supervision is the primary reason for CLIP's strong performance. However, we…
FAIR is probably the only lab outside of academia where research projects can start like this.
[7/8] This side project started in October when @TongPetersb, @_amirbar, and I were thinking about the rise of CLIP as a popular vision encoder for MLLMs. The community often assumes that language supervision is the primary reason for CLIP's strong performance. However, we…
Can visual SSL match CLIP on VQA? Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.