Pratyush Maini (@pratyushmaini)

Pinned

P

Pratyush Maini@pratyushmaini · Jan 30, 2024

1/7 Super excited about my Apple Internship work finally coming out: Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling TLDR: You can train 3x faster and with upto 10x lesser data with just synthetic rephrases of the web! 📝 arxiv.org/abs/2401.16380

pratyushmaini's tweet image. 1/7 Super excited about my Apple Internship work finally coming out:

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

TLDR: You can train 3x faster and with upto 10x lesser data with just synthetic rephrases of the web!

📝 arxiv.org/abs/2401.16380

14

83

421

264

112.0K

Pinned

P

Pratyush Maini@pratyushmaini · Jul 17

We got so tired of the cat 🐱 and mouse 🐭 game in LLM unlearning. What if we just made unlearning possible *by design*? This work is a cool amalgamation of various findings around LLM training dynamics and memorization, all while keeping an eye towards scalability in…

GGaurav Ghosal@gaurav_ghosal · Jul 17

1/So much of privacy research is designing post-hoc methods to make models mem. free. It’s time we turn that around with architectural changes. Excited to add Memorization Sinks to the transformer architecture this #ICML2025 to isolate memorization during LLM training🧵

1

2

8

0

750

Pinned

P

Pratyush Maini@pratyushmaini · Jul 16

Exciting work on ‘proactive’ membership inference! Check it out to know how to make your material attribution-ready with STAMPing!

PPratyush Maini@pratyushmaini · Jul 16

At #ICML2025, I am super excited to introduce STAMP. This is a marriage b/w dataset inference & watermarking that finally(!) lets creators PROVE their content was used to train LLMs🔍 Its a MAJOR push taking the academic problem into real world. w/Saksham Rastogi @danish037 🧵

0

1

6

3

941

P

Pratyush Maini@pratyushmaini · 22 h

Folks are going to be a lot more worried about using open-source models for synthetic data generation for pre/post-training. We are already noticing a trend where customers in the US are wary of using Chinese models. This research puts those speculations on a strong foundation.

OOwain Evans@OwainEvans_UK · Jul 22

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

1

0

22

3

2.0K

Pratyush Maini Retweeted

e

elie@eliebakouch · Jul 21

We've just release 100+ intermediate checkpoints and our training logs from SmolLM3-3B training. We hope this can be useful to the researcher working on mech interpret, training dynamics, RL and other topics :) Training logs: -> Usual training loss (the gap in the loss are due…

13

59

392

192

31.0K

P

Pratyush Maini@pratyushmaini · Jul 19

Giving an invited talk at the #MemFM workshop at ICML in 10 min. Room 223. I discuss why privacy & safety communities need to talk a lot more to each other, by discussing my own journey in quantifying, & erasing memorization, unsafe behaviours. PS: My first ever invited talk!🥹

pratyushmaini's tweet image. Giving an invited talk at the #MemFM workshop at ICML in 10 min. Room 223. I discuss why privacy &amp; safety communities need to talk a lot more to each other, by discussing my own journey in quantifying, &amp; erasing memorization, unsafe behaviours.

PS: My first ever invited talk!🥹

5

6

100

23

5.0K

P

Pratyush Maini@pratyushmaini · Jul 19

Starting now!!

AAdam Dziedzic@adam_dziedzic · Jul 19

Come and check the talk by @pratyushmaini on "Unlocking Post-hoc Dataset Inference with Synthetic Data" on the Data in Generative Models @ ICML 2025 Workshop. The talk is at 2:15 pm. This is joint work with Bihe Zhao, @fraboeni We are West Ballroom A.

0

1

13

2

1.0K

Pratyush Maini Retweeted

D

David Mizrahi@dmizrahi_ · Jul 17

Excited to share our new work: “Language Models Improve When Pretraining Data Matches Target Tasks” Yes, it sounds obvious (and it is!), but typically this only happens implicitly and indirectly: intuitively select data → benchmark → refine → repeat. We wondered: what…

7

50

399

406

53.0K

Pratyush Maini Retweeted

G

Gaurav Ghosal@gaurav_ghosal · Jul 17

1/So much of privacy research is designing post-hoc methods to make models mem. free. It’s time we turn that around with architectural changes. Excited to add Memorization Sinks to the transformer architecture this #ICML2025 to isolate memorization during LLM training🧵

1

23

57

29

7.0K

P

Pratyush Maini@pratyushmaini · Jul 17

It's high time we explore alternate architectures to make models natively unlearnable Memorization deeply entangles w/ generalization. We need to change this. I'm EXTREMELY excited about this line of work. We modify the transformer to enable unlearning! 🪧11 am today. E-1300

GGaurav Ghosal@gaurav_ghosal · Jul 17

1/So much of privacy research is designing post-hoc methods to make models mem. free. It’s time we turn that around with architectural changes. Excited to add Memorization Sinks to the transformer architecture this #ICML2025 to isolate memorization during LLM training🧵

0

26

3

926

P

Pratyush Maini@pratyushmaini · Jul 17

That feeling when Carlini comes to your poster, asks questions, and leaves saying, "ok, good" 🫠🫠 🥳

PPratyush Maini@pratyushmaini · Jul 16

At #ICML2025, I am super excited to introduce STAMP. This is a marriage b/w dataset inference & watermarking that finally(!) lets creators PROVE their content was used to train LLMs🔍 Its a MAJOR push taking the academic problem into real world. w/Saksham Rastogi @danish037 🧵

5

2

94

6

5.0K

P

Pratyush Maini@pratyushmaini · Jul 16

Detect content that appears ONLY ONCE in pretraining corpus 🙀

PPratyush Maini@pratyushmaini · Jul 16

At #ICML2025, I am super excited to introduce STAMP. This is a marriage b/w dataset inference & watermarking that finally(!) lets creators PROVE their content was used to train LLMs🔍 Its a MAJOR push taking the academic problem into real world. w/Saksham Rastogi @danish037 🧵

0

1

6

0

588

Pratyush Maini Retweeted

P

Pratyush Maini@pratyushmaini · Jul 16

Come over to poster E-1103 right now to learn more about STAMP!! arxiv.org/abs/2504.13416

0

1

4

1

335

P

Pratyush Maini@pratyushmaini · Jul 15

Last year we released our work on dataset inference. This work unlocks an important limitation of dataset inference: need for a held-our validation set, by a recipe that robustly synthesizes data that is IID to the train distribution. This makes DI possible post-hoc. Come over!

AAdam Dziedzic@adam_dziedzic · Jul 15

Come to our poster E-2608 at #ICML2025 to find out if your data was used to train a generative model. This is an amazing work with Bihe Zhao ⁦@pratyushmaini⁩ ⁦@fraboeni⁩!

0

4

32

6

2.0K

Pratyush Maini Retweeted

A

Aditi Raghunathan@AdtRaghunathan · Jul 15

I will be at #ICML2025 🇨🇦 from Wednesday through Saturday. My students have a lot of exciting papers - check them out and come talk to us! Especially thrilled to have received the Outstanding Paper Award🏆 for our work on creativity.

4

17

161

50

12.0K