Victoria X Lin
@VictoriaLinML
Research Scientist @AIatMeta | MoMa🖼 • RA-DIT🔍• OPT-IML Ex: @SFResearch • PhD @uwcse 📜 http://threads.net/@v.linspiration 🌴 Bay Area
1/n Introducing MoMa 🖼, our new sparse early-fusion architecture for mixed-modal language modeling that significantly boosts pre-training efficiency 🚀 (arxiv.org/pdf/2407.21770). MoMa employs a mixture-of-expert (MoE) framework with modality-specific expert groups. Given any…

We don't often see prep thread for paper announcement on X, but this mini crash course on the information capacity of LLM is well worth checking out
in prep for our new research dropping on ArXiv tomorrow (i think), here is a thread about.... CAPACITY MEASUREMENTS FOR LANGUAGE MODELS 🧵
the scale of data collection in the AI labs pales in comparison to 2010s google it’s mostly web scraping and data-labeling. compare that to diligently photographing streets of every country, mapping earth via satellite, scanning every book known to man.. now *that* was ambitious
and yet david burdeny still did it better in 2007 with a camera
I think this was the first AI image to really strike me. The first one to make me think that people were going to use this stuff to make very interesting works. Feels like a million years ago now.
🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46% 🌐 Website: multiverse4fm.github.io 🧵 1/n
So this is not a benchmark for software engineering agents. It’s meant to test core reasoning and intelligence through coding—backed by 71 pages of deep analysis from some of the best competitive programmers out there. This effort was carried out by students across multiple…
We introduce LiveCodeBench Pro, a live, exceptionally challenging benchmark comprising competitive programming problems sourced from IOI, Codeforces, and ICPC. Frontier models such as o3, and Gemini 2.5 achieve scores of 0% on the Hard split. Leaderboard: livecodebenchpro.com
splitting transformer parameters by⭐Understanding (X→text) vs. 📷Generation (X→image) functionality. We already did that in LMFusion
Let's talk about Mixture-of-Transformers (MoT) and heterogeneous omni-model training. 1. Inspired by prior architectures consisting of modality-specific parameters—such as Flamingo, CogVLM, BEIT-3, and MoMA—MoT (arxiv.org/abs/2411.04996) pushes this idea further by using…
Mixture-of-Transformers (MoTs) gain traction in new model designs. Here’s a visual breakdown of how it works 🧠👇
Surprising result! Spurious rewards -- even random rewards -- boost RLVR performance on Qwen models — but not on OLMo or others. The paper explores some hypotheses, but it’s still unclear why. Key takeaway: always validate across base models when probing reasoning with RLVR.
🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…
ByteDance | Seed has been consistently impressive over the past few months, publishing some truly insightful papers. BAGEL is one of them. I learned a lot from reading it. A few key takeaways: - Embedded "thinking" directly into native media generation, proving its effectiveness…
🚀 BAGEL — the Unified Multimodal Model with emergent capabilities and production-ready performance — is finally live! Dive in here: 👉 bagel-ai.org
I am pleased to announce a new version of my RL tutorial. Major update to the LLM chapter (eg DPO, GRPO, thinking), minor updates to the MARL and MBRL chapters and various sections (eg offline RL, DPG, etc). Enjoy! arxiv.org/abs/2412.05265
This is really cool work! I wonder if we could generalize even better by introducing modality as feature embedding to the router instead. That is router gets privileged information.
🎉 Excited to share: "𝐌𝐢𝐱𝐭𝐮𝐫𝐞-𝐨𝐟-𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫𝐬 (𝐌𝐨𝐓)" has been officially accepted to TMLR (March 2025) and the code is now open-sourced! 📌 GitHub repo: github.com/facebookresear… 📄 Paper: arxiv.org/abs/2411.04996 How can we reduce pretraining costs for…
🎉 Excited to share: "𝐌𝐢𝐱𝐭𝐮𝐫𝐞-𝐨𝐟-𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫𝐬 (𝐌𝐨𝐓)" has been officially accepted to TMLR (March 2025) and the code is now open-sourced! 📌 GitHub repo: github.com/facebookresear… 📄 Paper: arxiv.org/abs/2411.04996 How can we reduce pretraining costs for…
Meet ReasonIR-8B✨the first retriever specifically trained for reasoning tasks! Our challenging synthetic training data unlocks SOTA scores on reasoning IR and RAG benchmarks. ReasonIR-8B ranks 1st on BRIGHT and outperforms search engine and retriever baselines on MMLU and GPQA🔥
We should host more top ML conferences (ICLR, ICML, NeurIPS) in Asia
Our previous work showed that 𝐜𝐫𝐞𝐚𝐭𝐢𝐧𝐠 𝐯𝐢𝐬𝐮𝐚𝐥 𝐜𝐡𝐚𝐢𝐧‑𝐨𝐟‑𝐭𝐡𝐨𝐮𝐠𝐡𝐭𝐬 𝐯𝐢𝐚 𝐭𝐨𝐨𝐥 𝐮𝐬𝐞 significantly boosts GPT‑4o’s visual reasoning performance. Excited to see this idea incorporated into OpenAI’s o3 and o4‑mini models (openai.com/index/thinking…).…
Visual Chain-of-Thought with ✏️Sketchpad Happy to share ✏️Visual Sketchpad accepted to #NeurIPS2024. Sketchpad thinks🤔by creating visual reasoning chains for multimodal LMs, enhancing GPT-4o's reasoning on math and vision tasks We’ve open-sourced code: visualsketchpad.github.io
Our Llama 4’s industry leading 10M+ multimodal context length (20+ hours of video) has been a wild ride. The iRoPE architecture I’d been working on helped a bit with the long-term infinite context goal toward AGI. Huge thanks to my incredible teammates! 🚀Llama 4 Scout 🔹17B…
Introducing our first set of Llama 4 models! We’ve been hard at work doing a complete re-design of the Llama series. I’m so excited to share it with the world today and mark another major milestone for the Llama herd as we release the *first* open source models in the Llama 4…
Introducing DRAMA🎭: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers. We propose to train a smaller dense retriever using a pruned LLM as the backbone, fine-tuned with diverse LLM data augmentations. With single-stage training, DRAMA achieves strong…
we've been working on democratizing fast kernel writing on the @PyTorch team. try the challenge, either you or your AI!
Write a fast kernel and run it on Discord. See how you compare against the best! If you're familiar with Leetcode, Kaggle or Codeforces then this should feel right at home
New features added to MassiveDS-pipe to make it painless to build and serve trillion-token datastore: 1. Distributed API serving (<30ms latency); 2. Efficient indices: IVF-Flat, IVF-PQ; 3. Memory-free fast passage loading. It has been adopted by AI2 OpenScholar and Meta EWE 🥳