Dylan Sam
@dylanjsam
phd student @mldcmu | past: student researcher @GoogleAI, intern @AmazonScience, undergrad @BrownUniversity
Excited to share new work from my internship @GoogleAI ! Curious as to how we should measure the similarity between examples in pretraining datasets? We study the role of similarity in pretraining 1.7B parameter language models on the Pile. arxiv: arxiv.org/abs/2502.02494 1/🧵

@NeurIPSConf, why take the option to provide figures in the rebuttals away from the authors during the rebuttal period? Grounding the discussion in hard evidential data (like plots) makes resolving disagreements much easier for both the authors and the reviewers. Left: NeurIPS…
At #ICML2025, I am super excited to introduce STAMP. This is a marriage b/w dataset inference & watermarking that finally(!) lets creators PROVE their content was used to train LLMs🔍 Its a MAJOR push taking the academic problem into real world. w/Saksham Rastogi @danish037 🧵
1/6 Retrieval is supposed to improve generation in RAG systems. But in practice, adding more documents can hurt performance, even when relevant ones are retrieved. We introduce RAGGED, a framework to measure and diagnose when retrieval helps and when it hurts.
🥳🥳🥳I defended my PhD thesis today! Special thanks to my wonderful advisor @zicokolter and committee members @rsalakhu @gneubig @LesterMackey! 🎉🎉🎉I am joining @OpenAI as a researcher, super excited to keep working on frontier models and meet everyone in SF!
✨ Did you know that NOT using all generated rollouts in GRPO can boost your reasoning LLM? Meet PODS! We down-sample rollouts and train on just a fraction, delivering notable gains over vanilla GRPO. (1/7)
Big news! 🎉 I’m joining UNC-Chapel Hill as an Assistant Professor in Computer Science starting next year! Before that, I’ll be spending time @OpenAI working on LLM privacy. @unccs @uncnlp
Excited to share our work with my amazing collaborators, @Goodeat258, @SimulatedAnneal, @zicokolter, and Kaiming. In a word, we show an “identity learning” approach for generative modeling, by relating the instantaneous/average velocity in an identity. The resulting model,…
Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦 arxiv.org/abs/2504.21850 1/10
🚀 Excited to share our new work on data generation for IR! We create synthetic multi-level ranking contexts for training dense retrievers. Now it’s easy to build custom retrieval datasets and move beyond the standard InfoNCE loss by learning from fine-grained relevance levels!🧵
Why can foundation models transfer to so many downstream tasks? Will the scaling law end? Will pretraining end like Ilya Sutskever predicted? My PhD thesis builds the contexture theory to answer the above. Blog: runtianzhai.com/thesis Paper: arxiv.org/abs/2504.19792 🧵1/12
✨ Love 4o-style image generation but prefer to use Midjourney? Tired of manual prompt crafting from inspo images? PRISM to the rescue! 🖼️→📝→🖼️ We automate black-box prompt engineering—no training, no embeddings, just accurate, readable prompts from your inspo images! 1/🧵
Why do larger language models generalize better? In our new ICLR paper, we derive an interpretable generalization bound showing that compute-optimal LLMs provably generalize better with scale! 📄arxiv.org/abs/2504.15208 1/7🧵
Internet-scale datasets of videos and natural language are a rich training source! But can they be used to facilitate novel downstream robotic behaviors across embodiments and environments? Our new #ICLR2025 paper, Adapt2Act, shows how.
Are you a frontier lab investing untold sums in training? Are you trying to stay competitive? Are you finding that your competitors' models are ... thinking a bit too much like yours? Then antidistillation.com might be for you! @sama @elonmusk
📉📉NEW SCALING LAW PHENOMENON 📉📉 We find that knowledge and reasoning exhibit different scaling behaviors! Super excited to finally tell you all about our paper on the compute optimal scaling of skills: arxiv.org/pdf/2503.10061 [1/n]
1/Being in academia is such a privilege: You get to collaborate with insanely talented & passionate students on their journey to upskill themselves. Very excited to share *OpenUnlearning*: a unified, easily extensible framework for unlearning led by @anmol_mekala @VineethDorna🧵
Interacting with the external world and reacting based on outcomes are crucial capabilities of agentic systems, but existing LLMs’ ability to do so is limited. Introducing Paprika 🌶️, our work on making LLMs general decision makers than can solve new tasks zero-shot. 🧵 1/n
What makes a Large Language Model unique? I am excited to share our new work “Idiosyncrasies in Large Language Models”. We demonstrate that LLMs exhibit idiosyncrasies – unique patterns in their outputs that enable us to distinguish these models with exceedingly high accuracies.
This spring I am teaching a new class at MIT called **How to AI (Almost) Anything** Its name is a play on 2 seminal @medialab courses: how to make almost anything (on design & fabrication) and how to grow almost anything (on synthetic biology) We are now in the AI age, and…