Mayee Chen

@MayeeChen

CS PhD student @StanfordAILab @HazyResearch, undergrad @princeton. Working on all things data! she/her 🎃

Stanford, CA

Joined February 2020

625Following

2KFollowers

Pinned

Mayee Chen@MayeeChen · Nov 12

There are many algorithms for constructing pre-training data mixtures—which one should we use? Turns out: many of them fall under one framework, have similar issues, and can be improved with a straightforward modification. Introducing Aioli! 🧄 1/9

MayeeChen's tweet image. There are many algorithms for constructing pre-training data mixtures—which one should we use? Turns out: many of them fall under one framework, have similar issues, and can be improved with a straightforward modification.

Introducing Aioli! 🧄 1/9

185

25.0K

Pinned

Mayee Chen Retweeted

Gene Li@geneli0 · Jul 21

like everyone else i am hopping on the blog post trend gene.ttic.edu/blog/incomplet…

177

188

16.0K

Pinned

Mayee Chen Retweeted

Unseen Japan@UnseenJapanSite · Jun 25

Not Japan-related, but since we all need a distraction from The Horrors, Takaya Suzuki points out a study that examined 408 sleeping cats and found the majority (65%) curl leftwards. I'm not sure how useful this information is, but...it's yours now.

5.0K

29.0K

3.0K

508.0K

Mayee Chen Retweeted

translated cats@TranslatedCats · Jul 25

He is happy with and without the apple.. we have a lot to learn from him.

189

18.0K

161.0K

8.0K

2.7M

Mayee Chen Retweeted

Qinyuan Ye@qinyuan_ye · Jul 22

1+1=3 2+2=5 3+3=? Many language models (e.g., Llama 3 8B, Mistral v0.1 7B) will answer 7. But why? We dig into the model internals, uncover a function induction mechanism, and find that it’s broadly reused when models encounter surprises during in-context learning. 🧵

102

7.0K

Mayee Chen Retweeted

Charlie Hou@hou_char · Jul 16

[#ICML2025] Have you ever wanted to train LLMs on distributed private data but were blocked by model size or privacy constraints 😔? Here’s a solution: Introducing 🌸POPri (Policy Optimization for Private Data)! Poster 🗓️ today at 4:30pm PT, 📍East Exhibition Hall A-B E-1006

1.0K

Mayee Chen Retweeted

Nicholas Roberts@nick11roberts · Jul 14

🎉 Excited to share that our paper "Pretrained Hybrids with MAD Skills" was accepted to @COLM_conf 2025! We introduce Manticore - a framework for automatically creating hybrid LMs from pretrained models without training from scratch. 🧵[1/n]

6.0K

Mayee Chen Retweeted

Alex Wettig@_awettig · Feb 18

🤔 Ever wondered how prevalent some type of web content is during LM pre-training? In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐 Key takeaway: domains help us curate better pre-training data! 🧵/N

210

104

48.0K

Mayee Chen@MayeeChen · Jul 11

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.

SSukjun (June) Hwang@sukjun_hwang · Jul 11

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

184

1.0K

755

186.0K

Mayee Chen Retweeted

Tzu-Heng Huang@zihengh1 · Jul 10

Efficient data curation is critical for modern ML. 📣 We introduce Mimic Score, a new, lightweight, model-based metric for sample utility that leverages reference model's weights to identify high-value samples and accelerate training. 🎉 Accepted as an Oral at ICML’25 DataWorld!

4.0K

Mayee Chen@MayeeChen · Jul 9

Excited to share CUPID 💘 won the Best Paper Award at the #RSS2025 RoboEval workshop! TL;DR: In 🤖 imitation learning, data quality is crucial. CUPID 💘 directly identifies “good” robot data, i.e., whether a demo will improve success rates. ➡️ big gains via data curation! 🧵👇

CChristopher Agia@agiachris · Jun 25

What makes data “good” for robot learning? We argue: it’s the data that drives closed-loop policy success! Introducing CUPID 💘, a method that curates demonstrations not by "quality" or appearance, but by how they influence policy behavior, using influence functions. (1/6)

2.0K

Mayee Chen Retweeted

Michael Hu ✈️ ACL 2025 🇦🇹@michahu8 · Jul 2

📢 today's scaling laws often don't work for predicting downstream task performance. For some pretraining setups, smooth and predictable scaling is the exception, not the rule. a quick read about scaling law fails: 📜arxiv.org/abs/2507.00885 🧵1/5👇

283

193

28.0K

Mayee Chen Retweeted

Jerry Liu@jerrywliu · Jul 7

1/10 ML can solve PDEs – but precision🔬is still a challenge. Towards high-precision methods for scientific problems, we introduce BWLer 🎳, a new architecture for physics-informed learning achieving (near-)machine-precision (up to 10⁻¹² RMSE) on benchmark PDEs. 🧵How it works:

122

645

546

81.0K

Mayee Chen Retweeted

Valentina Pyatkin@valentina__py · Jul 3

💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR. But the set of constraints and verifier functions is limited and most models overfit on IFEval. We introduce IFBench to measure model generalization to unseen constraints.

346

183

46.0K

Mayee Chen Retweeted

Albert Ge@albert_ge_95 · May 8

Online data mixing reduces training costs for foundation models, but faces challenges: ⚠️ Human-defined domains miss semantic nuances ⚠️ Limited eval accessibility ⚠️ Poor scalability Introducing 🎵R&B: first regroup data, then dynamically reweight domains during training!

11.0K

Mayee Chen Retweeted

Ai2@allen_ai · Jul 1

Introducing SciArena, a platform for benchmarking models across scientific literature tasks. Inspired by Chatbot Arena, SciArena applies a crowdsourced LLM evaluation approach to the scientific domain. 🧵

409

210

69.0K

Mayee Chen Retweeted

Nouha Dziri@nouhadziri · Jun 24

📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬 We built a benchmark to find out → OMEGA Ω 📐 💥 We found…

159

725

669

164.0K

Mayee Chen Retweeted

Tanishq Abraham back from ICML@iScienceLuvr · Jun 24

Shrinking the Generation-Verification Gap with Weak Verifiers "we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers." "Weaver leverages weak supervision to estimate each verifier’s accuracy and combines their outputs…

124

11.0K

Mayee Chen Retweeted

Azalia Mirhoseini@Azaliamirh · Jun 26

Introducing Weaver, a test time scaling method for verification! Weaver shrinks the generation-verification gap through a low-overhead weak-to-strong optimization of a mixture of verifiers (e.g., LM judges and reward models). The Weavered mixture can be distilled into a tiny…

225

127

18.0K

Mayee Chen Retweeted

Christopher Agia@agiachris · Jun 25

114

23.0K

Mayee Chen@MayeeChen · Jun 24

Very exciting work on using weak supervision for RL- closing the “generation-verification gap”!! Once again- principled approaches to labeling/data development are the keys!

JJon Saad-Falcon@JonSaadFalcon · Jun 24

How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning…

4.0K