Tristan Thrush
@TristanThrush
PhD-ing @StanfordAILab @stanfordnlp. Interested in data, multimodality, scaling, and many more things.
Do you want to select great LLM pretraining data but don’t have 1000 H100s for a ton of mixture experiments? What about a method that requires none of your own training, matches the best known existing method, and has some nice theory? New preprint: Perplexity Correlations

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
1/ Model architectures have been mostly treated as fixed post-training. 🌱 Introducing Grafting: A new way to edit pretrained diffusion transformers, allowing us to customize architectural designs on a small compute budget. 🌎 grafting.stanford.edu Co-led with @MichaelPoli6
Can you rotate a dice 🎲 in your head? Mental imagery plays a key role in perspective reasoning for humans - but can it help VLMs reason spatially? We show that Abstract Perspective Change significantly improves VLM reasoning from unseen views. Check out our preprint for more:
❗️Vision-Language Models (VLMs) struggle with even basic perspective changes! ✏️ In our new preprint, we aim to extend the spatial reasoning capabilities of VLMs to ⭐️arbitrary⭐️ perspectives. 📄Paper: arxiv.org/abs/2504.17207 🔗Project: apc-vlm.github.io 🧵[1/N]
🚀 In T-minus 1 week, I’ll be at ICLR presenting MrT5! The final version has tons of updates: - New controller algorithm for targeted compression rates - More baselines and downstream tasks - Scaled-up experiments to 1.23B parameter models And now, MrT5 is on 🤗HuggingFace! 🧵
🔭 Science relies on shared artifacts collected for the common good. 🛰 So we asked: what's missing in open language modeling? 🪐 DataDecide 🌌 charts the cosmos of pretraining—across scales and corpora—at a resolution beyond any public suite of models that has come before.
Ever wonder how LLM developers choose their pretraining data? It’s not guesswork— all AI labs create small-scale models as experiments, but the models and their data are rarely shared. DataDecide opens up the process: 1,050 models, 30k checkpoints, 25 datasets & 10 benchmarks 🧵
Today, we're releasing a new paper – One-Minute Video Generation with Test-Time Training. We add TTT layers to a pre-trained Transformer and fine-tune it to generate one-minute Tom and Jerry cartoons with strong temporal consistency. Every video below is produced directly by…
An LLM generates an article verbatim—did it “train on” the article? It’s complicated: under n-gram definitions of train-set inclusion, LLMs can complete “unseen” texts—both after data deletion and adding “gibberish” data. Our results impact unlearning, MIAs & data transparency🧵
.@stanfordnlp will be in @Singapore with lots of @iclr_conf papers. @TristanThrush, @ChrisGPotts & @tatsu_hashimoto will show how to select good pretraining data: LLM losses on texts correlate with downstream benchmarks, so select high-correlation docs. arxiv.org/abs/2409.05816
Do you want to select great LLM pretraining data but don’t have 1000 H100s for a ton of mixture experiments? What about a method that requires none of your own training, matches the best known existing method, and has some nice theory? New preprint: Perplexity Correlations