Peter Tong

@TongPetersb

Berkeley 23', CS PhD Student in NYU Courant advised by Professor @ylecun and Professor @sainingxie

Joined July 2022

137Following

1KFollowers

Pinned

Peter Tong@TongPetersb · Dec 19

This project really changed how I think about multimodal models and LLMs. I used to believe that multimodal (visual) prediction required significant changes to the model and heavy pretraining, like Chameleon. But surprisingly, the opposite is true! In large autoregressive models,…

ZZhuang Liu@liuzhuang1234 · Dec 19

How far is an LLM from not only understanding but also generating visually? Not very far! Introducing MetaMorph---a multimodal understanding and generation model. In MetaMorph, understanding and generation benefit each other. Very moderate generation data is needed to elicit…

474

449

126.0K

Peter Tong Retweeted

Zhuang Liu@liuzhuang1234 · Jun 27

Can diffusion models appear to be learning, when they’re actually just memorizing the training data? We show and investigate this phenomenon in the context of neural network weight generation, in our recent paper “Generative Modeling of Weights: Generalization or Memorization?"

347

221

50.0K

Peter Tong Retweeted

Infini-AI-Lab@InfiniAILab · Jun 16

🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46% 🌐 Website: multiverse4fm.github.io 🧵 1/n

220

108

77.0K

Peter Tong Retweeted

NYU Center for Data Science@NYUDataScience · Jun 11

CDS Prof. @ylecun and CDS-affiliated Asst. Prof. @sainingxie show that visual self-supervised models, when scaled to 7B parameters and trained on 2B images, can match or outperform CLIP on VQA tasks—without any language supervision. Read the paper: arxiv.org/abs/2504.01017

36.0K

Peter Tong Retweeted

Junhong Shen@JunhongShen1 · Jun 10

🔥Unlocking New Paradigm for Test-Time Scaling of Agents! We introduce Test-Time Interaction (TTI), which scales the number of interaction steps beyond thinking tokens per step. Our agents learn to act longer➡️richer exploration➡️better success Paper: arxiv.org/abs/2506.07976

166

79.0K

Peter Tong Retweeted

Jack Bai@jackbai_jkb · Jun 10

🧵 1/7 Should AI agents "think more" or "do more"? 🤔 The current trend is to scale test-time compute, making agents generate longer reasoning traces. But what if that’s the wrong approach for interactive tasks? In our new work, we argue for a new scaling dimension: Test-Time…

13.0K

Peter Tong@TongPetersb · May 14

🚀 Glad to see our All-Angles Bench (github.com/Chenyu-Wang567…) being adopted to evaluate 3D spatial understanding in Seed-1.5-VL-thinking along with OpenAI (o1) and Gemini 2.5 Pro..!

YYujia Qin@TsingYoga · May 12

Introducing Seed-1.5-VL-thinking, the model achieves SOTA on 38 out of 60 VLM benchmarks🥳🥳🥳 github.com/ByteDance-Seed…

2.0K

Peter Tong Retweeted

Chun-Hsiao (Daniel) Yeh@danielyehhh · May 7

❗️❗️ Can MLLMs understand scenes from multiple camera viewpoints — like humans? 🧭 We introduce All-Angles Bench — 2,100+ QA pairs on multi-view scenes. 📊 We evaluate 27 top MLLMs, including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o. 🌐 Project: danielchyeh.github.io/All-Angles-Ben…

17.0K

Peter Tong Retweeted

Xindi Wu@cindy_x_wu · May 2

Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦 arxiv.org/abs/2504.21850 1/10

159

51.0K

Peter Tong Retweeted

David Fan@DavidJFan · May 2

Web-SSL model weights are now available on GitHub and HuggingFace! You may use your favorite Transformers library API calls or load the model with native PyTorch - up to your preference. For more usage details, please see github.com/facebookresear… HuggingFace collection:…

615

Peter Tong@TongPetersb · Apr 24

Recently open-sourced projects from @TongPetersb, @DavidJFan, and the team at Meta FAIR. MetaMorph (training code and model weights): github.com/facebookresear… Web-SSL (model weights for Web-DINO and Web-MAE) github.com/facebookresear… FAIR's still leading the way in open research.

PPeter Tong@TongPetersb · Apr 24

We are open-sourcing all the models in Web-SSL, from ViT-L to ViT-7B! It was super fun to train and play with these massive ViTs. Models: huggingface.co/collections/fa… Github: github.com/facebookresear… Huge credit to @DavidJFan for putting these models together!

10.0K

Peter Tong@TongPetersb · Apr 15

Excited to release the training code for MetaMorph! MetaMorph offers a simple yet effective way to convert LLMs into a multimodal LLM that not only takes multimodal inputs, but also generates multimodal outputs via AR prediction. This confers the ability to “think visually”, and…

PPeter Tong@TongPetersb · Apr 15

We're open-sourcing the training code for MetaMorph! MetaMorph offers a lightweight framework for turning LLMs into unified multimodal models: (multimodal) tokens -> transformers -> diffusion -> pixel! This is our best take on unified modeling as of November 2024, and…

3.0K

Peter Tong Retweeted

Yann LeCun@ylecun · Apr 3

New paper from FAIR+NYU: Q: Is language supervision required to learn effective visual representations for multimodal tasks? A: No. ⬇️⬇️⬇️

601

205

84.0K

Peter Tong@TongPetersb · Apr 2

In Cambrian-1, we found that vision SSL representations usually lagged behind language-supervised ones -- but once the data gap is closed and scaling kicks in, performance catches up. We’ve tried scaling SSL before, but this is the first time I’ve seen real signal: SSL adapts to…

DDavid Fan@DavidJFan · Apr 2

Can visual SSL match CLIP on VQA? Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.

245

102

28.0K

Peter Tong Retweeted

David Fan@DavidJFan · Apr 2

[7/8] This side project started in October when @TongPetersb, @_amirbar, and I were thinking about the rise of CLIP as a popular vision encoder for MLLMs. The community often assumes that language supervision is the primary reason for CLIP's strong performance. However, we…

20.0K

Peter Tong@TongPetersb · Apr 2

FAIR is probably the only lab outside of academia where research projects can start like this.

DDavid Fan@DavidJFan · Apr 2

112

16.0K

Peter Tong Retweeted

David Fan@DavidJFan · Apr 2

458

303

70.0K