Mayee Chen
@MayeeChen
CS PhD student @StanfordAILab @HazyResearch, undergrad @princeton. Working on all things data! she/her 🎃
There are many algorithms for constructing pre-training data mixtures—which one should we use? Turns out: many of them fall under one framework, have similar issues, and can be improved with a straightforward modification. Introducing Aioli! 🧄 1/9

like everyone else i am hopping on the blog post trend gene.ttic.edu/blog/incomplet…
Not Japan-related, but since we all need a distraction from The Horrors, Takaya Suzuki points out a study that examined 408 sleeping cats and found the majority (65%) curl leftwards. I'm not sure how useful this information is, but...it's yours now.
He is happy with and without the apple.. we have a lot to learn from him.
1+1=3 2+2=5 3+3=? Many language models (e.g., Llama 3 8B, Mistral v0.1 7B) will answer 7. But why? We dig into the model internals, uncover a function induction mechanism, and find that it’s broadly reused when models encounter surprises during in-context learning. 🧵
[#ICML2025] Have you ever wanted to train LLMs on distributed private data but were blocked by model size or privacy constraints 😔? Here’s a solution: Introducing 🌸POPri (Policy Optimization for Private Data)! Poster 🗓️ today at 4:30pm PT, 📍East Exhibition Hall A-B E-1006
🎉 Excited to share that our paper "Pretrained Hybrids with MAD Skills" was accepted to @COLM_conf 2025! We introduce Manticore - a framework for automatically creating hybrid LMs from pretrained models without training from scratch. 🧵[1/n]
🤔 Ever wondered how prevalent some type of web content is during LM pre-training? In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐 Key takeaway: domains help us curate better pre-training data! 🧵/N
Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
Efficient data curation is critical for modern ML. 📣 We introduce Mimic Score, a new, lightweight, model-based metric for sample utility that leverages reference model's weights to identify high-value samples and accelerate training. 🎉 Accepted as an Oral at ICML’25 DataWorld!
Excited to share CUPID 💘 won the Best Paper Award at the #RSS2025 RoboEval workshop! TL;DR: In 🤖 imitation learning, data quality is crucial. CUPID 💘 directly identifies “good” robot data, i.e., whether a demo will improve success rates. ➡️ big gains via data curation! 🧵👇
What makes data “good” for robot learning? We argue: it’s the data that drives closed-loop policy success! Introducing CUPID 💘, a method that curates demonstrations not by "quality" or appearance, but by how they influence policy behavior, using influence functions. (1/6)
📢 today's scaling laws often don't work for predicting downstream task performance. For some pretraining setups, smooth and predictable scaling is the exception, not the rule. a quick read about scaling law fails: 📜arxiv.org/abs/2507.00885 🧵1/5👇
1/10 ML can solve PDEs – but precision🔬is still a challenge. Towards high-precision methods for scientific problems, we introduce BWLer 🎳, a new architecture for physics-informed learning achieving (near-)machine-precision (up to 10⁻¹² RMSE) on benchmark PDEs. 🧵How it works:
💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR. But the set of constraints and verifier functions is limited and most models overfit on IFEval. We introduce IFBench to measure model generalization to unseen constraints.
Online data mixing reduces training costs for foundation models, but faces challenges: ⚠️ Human-defined domains miss semantic nuances ⚠️ Limited eval accessibility ⚠️ Poor scalability Introducing 🎵R&B: first regroup data, then dynamically reweight domains during training!
Introducing SciArena, a platform for benchmarking models across scientific literature tasks. Inspired by Chatbot Arena, SciArena applies a crowdsourced LLM evaluation approach to the scientific domain. 🧵
📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬 We built a benchmark to find out → OMEGA Ω 📐 💥 We found…
Shrinking the Generation-Verification Gap with Weak Verifiers "we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers." "Weaver leverages weak supervision to estimate each verifier’s accuracy and combines their outputs…
Introducing Weaver, a test time scaling method for verification! Weaver shrinks the generation-verification gap through a low-overhead weak-to-strong optimization of a mixture of verifiers (e.g., LM judges and reward models). The Weavered mixture can be distilled into a tiny…
What makes data “good” for robot learning? We argue: it’s the data that drives closed-loop policy success! Introducing CUPID 💘, a method that curates demonstrations not by "quality" or appearance, but by how they influence policy behavior, using influence functions. (1/6)
Very exciting work on using weak supervision for RL- closing the “generation-verification gap”!! Once again- principled approaches to labeling/data development are the keys!
How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning…