Pratyush Maini
@pratyushmaini
Data Quality x Privacy | PhD student @mldcmu | Founding Member @datologyai | Prev. Comp Sc @iitdelhi 🦋: https://bsky.app/profile/pratyushmaini.bsky.social
1/7 Super excited about my Apple Internship work finally coming out: Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling TLDR: You can train 3x faster and with upto 10x lesser data with just synthetic rephrases of the web! 📝 arxiv.org/abs/2401.16380

We got so tired of the cat 🐱 and mouse 🐭 game in LLM unlearning. What if we just made unlearning possible *by design*? This work is a cool amalgamation of various findings around LLM training dynamics and memorization, all while keeping an eye towards scalability in…
1/So much of privacy research is designing post-hoc methods to make models mem. free. It’s time we turn that around with architectural changes. Excited to add Memorization Sinks to the transformer architecture this #ICML2025 to isolate memorization during LLM training🧵
Exciting work on ‘proactive’ membership inference! Check it out to know how to make your material attribution-ready with STAMPing!
At #ICML2025, I am super excited to introduce STAMP. This is a marriage b/w dataset inference & watermarking that finally(!) lets creators PROVE their content was used to train LLMs🔍 Its a MAJOR push taking the academic problem into real world. w/Saksham Rastogi @danish037 🧵
Folks are going to be a lot more worried about using open-source models for synthetic data generation for pre/post-training. We are already noticing a trend where customers in the US are wary of using Chinese models. This research puts those speculations on a strong foundation.
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
We've just release 100+ intermediate checkpoints and our training logs from SmolLM3-3B training. We hope this can be useful to the researcher working on mech interpret, training dynamics, RL and other topics :) Training logs: -> Usual training loss (the gap in the loss are due…
Giving an invited talk at the #MemFM workshop at ICML in 10 min. Room 223. I discuss why privacy & safety communities need to talk a lot more to each other, by discussing my own journey in quantifying, & erasing memorization, unsafe behaviours. PS: My first ever invited talk!🥹

Starting now!!
Come and check the talk by @pratyushmaini on "Unlocking Post-hoc Dataset Inference with Synthetic Data" on the Data in Generative Models @ ICML 2025 Workshop. The talk is at 2:15 pm. This is joint work with Bihe Zhao, @fraboeni We are West Ballroom A.
Excited to share our new work: “Language Models Improve When Pretraining Data Matches Target Tasks” Yes, it sounds obvious (and it is!), but typically this only happens implicitly and indirectly: intuitively select data → benchmark → refine → repeat. We wondered: what…
1/So much of privacy research is designing post-hoc methods to make models mem. free. It’s time we turn that around with architectural changes. Excited to add Memorization Sinks to the transformer architecture this #ICML2025 to isolate memorization during LLM training🧵
It's high time we explore alternate architectures to make models natively unlearnable Memorization deeply entangles w/ generalization. We need to change this. I'm EXTREMELY excited about this line of work. We modify the transformer to enable unlearning! 🪧11 am today. E-1300
1/So much of privacy research is designing post-hoc methods to make models mem. free. It’s time we turn that around with architectural changes. Excited to add Memorization Sinks to the transformer architecture this #ICML2025 to isolate memorization during LLM training🧵
That feeling when Carlini comes to your poster, asks questions, and leaves saying, "ok, good" 🫠🫠 🥳
At #ICML2025, I am super excited to introduce STAMP. This is a marriage b/w dataset inference & watermarking that finally(!) lets creators PROVE their content was used to train LLMs🔍 Its a MAJOR push taking the academic problem into real world. w/Saksham Rastogi @danish037 🧵
Detect content that appears ONLY ONCE in pretraining corpus 🙀
At #ICML2025, I am super excited to introduce STAMP. This is a marriage b/w dataset inference & watermarking that finally(!) lets creators PROVE their content was used to train LLMs🔍 Its a MAJOR push taking the academic problem into real world. w/Saksham Rastogi @danish037 🧵
Come over to poster E-1103 right now to learn more about STAMP!! arxiv.org/abs/2504.13416
Last year we released our work on dataset inference. This work unlocks an important limitation of dataset inference: need for a held-our validation set, by a recipe that robustly synthesizes data that is IID to the train distribution. This makes DI possible post-hoc. Come over!
Come to our poster E-2608 at #ICML2025 to find out if your data was used to train a generative model. This is an amazing work with Bihe Zhao @pratyushmaini @fraboeni!
I will be at #ICML2025 🇨🇦 from Wednesday through Saturday. My students have a lot of exciting papers - check them out and come talk to us! Especially thrilled to have received the Outstanding Paper Award🏆 for our work on creativity.