Jacob Springer (@jacspringer)

Pinned

J

Jacob Springer@jacspringer · Mar 26

Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9

jacspringer's tweet image. Training with more data = better LLMs, right? 🚨

False! Scaling language models by adding more pre-training data can decrease your performance after post-training!

Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇

1/9

17

181

816

651

160.0K

J

Jacob Springer@jacspringer · Jul 17

Come to our #ICML2025 poster today to learn how overtraining language models can violate the main assumption we make when pretraining: that more data is better. Thursday 4:30pm @ East Exhibition Hall A-B #E-2508

JJacob Springer@jacspringer · Mar 26

Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9

1

2

27

8

2.0K

Jacob Springer Retweeted

V

Vaishnavh Nagarajan@_vaishnavh · Jun 2

📢 New paper on creativity & multi-token prediction! We design minimal open-ended tasks to argue: → LLMs are limited in creativity since they learn to predict the next token → creativity can be improved via multi-token learning & injecting noise ("seed-conditioning" 🌱) 1/ 🧵

1

40

165

112

27.0K

Jacob Springer Retweeted

F

Fahim Tajwar@FahimTajwar10 · May 28

RL with verifiable reward has shown impressive results in improving LLM reasoning, but what can we do when we do not have ground truth answers? Introducing Self-Rewarding Training (SRT): where language models provide their own reward for RL training! 🧵 1/n

21

143

838

865

82.0K

J

Jacob Springer@jacspringer · May 2

Our work on the surprising negative effects of pre-training LMs for longer has received awards from both of these workshops and has been accepted to ICML 25! One of my first papers as a senior author :) arxiv.org/abs/2503.19206

SSadhika Malladi@SadhikaMalladi · Apr 21

5⃣ Overtraining (ICBINB oral and SCOPE oral; led by @jacspringer): training your LM for longer makes it harder to fine-tune. Exps on various model scales and downstream tasks, and theory characterizes this phenomenon in a simple transfer learning setting. arxiv.org/abs/2503.19206

0

6

46

15

5.0K

Jacob Springer Retweeted

S

SOUVIK KUNDU@thisissouvikk · Apr 29

#ICLR2025 #SCOPEICLR2025 That's wrap from @iclr_conf , first #ICLRatAsia, admittedly one of the most engaging conference I ever had. 📌Happy to see the overwhelming response at the #SCOPEICLR25 workshop! Thanks to the in-person co-runners @ayazdanb @Shiwei_Liu66! and congrats…

0

2

12

0

1.0K

Jacob Springer Retweeted

A

Ankit Pandey@chiralBuilder · Apr 28

Came across this interesting research at @iclr_conf which finds out that pre-training after a certain scale (# of tokens) affects the fine-tuning performance.. had a great discussion with @jacspringer about it.

0

3

17

5

1.0K

J

Jacob Springer@jacspringer · Apr 25

Excited to present our recent findings on "catastrophic overtraining" where more pre-training data (shockingly) can lead to worse downstream models.

SSpurious Correlation & Shortcut Learning Workshop@SCSLWorkshop · Apr 25

🚨 Join us at the Workshop on Spurious Correlation & Shortcut Learning (SCSL) at #ICLR2025! @iclr_conf 🗓️ April 28, 2025 📍 Garnet 214-215, Singapore EXPO 🌐 More info: scslworkshop.github.io #ICLR2025

3

9

48

7

5.0K

J

Jacob Springer@jacspringer · Apr 27

Come learn about how excessive pre-training can make your LLM harder to fine-tune at two #ICLR2025 workshops tomorrow (Monday): I'll be giving talks at: - ICBINB (10:30am Hall 4 #1) - SCOPE (2:30pm Peridot 204-205) Plus I'll be attending the poster sessions to chat!

JJacob Springer@jacspringer · Mar 26

Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9

0

2

18

6

2.0K

J

Jacob Springer@jacspringer · Apr 25

Come check out our poster: Repetition Improves Language Model Embeddings at #ICLR2025! We convert autoregressive language models into high quality text embedders—optionally wihout additional training—just by repeating the input. Saturday April 26: 10am-12:30pm poster #176

JJacob Springer@jacspringer · Feb 26, 2024

Autoregressive language models (LLaMA, Mistral, etc) are fundamentally limited for text embeddings since they don’t encode information bidirectionally. We provide an easy fix: just repeat your input! We are the #1 fully-open-source model on MTEB! arxiv.org/abs/2402.15449 1/6

2

13

49

25

6.0K

Jacob Springer Retweeted

S

Sadhika Malladi@SadhikaMalladi · Apr 21

I won't be at ICLR this year, but I'm sharing my papers here, including an oral in the main conference and several orals in the workshops. Topics include: distillation, data selection for pre-training, benchmark + provable alg for unlearning, pref learning, and overtraining.

1

6

41

15

6.0K

Jacob Springer Retweeted

S

Sachin Goyal@goyalsachin007 · Apr 22

Excited to be at #ICLR2025 🇸🇬 to talk about my recent works👇that uncover key pitfalls & inefficiencies in pretraining & inference🚨. Final PhD lap —thinking a lot about how pretraining interventions can shape downstream behaviors (like reasoning & safety). DM to chat or vibe!

1

7

45

8

3.0K

Jacob Springer Retweeted

C

Christina Baek@_christinabaek · Apr 16

Are current reasoning models optimal for test-time scaling? 🌠 No! Models make the same incorrect guess over and over again. We show that you can fix this problem w/o any crazy tricks 💫 – just do weight ensembling (WiSE-FT) for big gains on math! 1/N

7

103

484

327

53.0K

Jacob Springer Retweeted

�

𝚐𝔪𝟾𝚡𝚡𝟾@gm8xx8 · Apr 16

Looking beyond the next token TRELAWNEY inserts future tokens <T>...</T> during training to teach models to plan ahead—boosting reasoning, coherence, and control. Highlights: - NO ARCHITECTURE CHANGES. JUST SMARTER DATA. - works with standard decoding - enables controllable…

10

55

292

204

23.0K

Jacob Springer Retweeted

T

Tanishq Kumar@tanishqkumar07 · Apr 16

trained a nanoGPT? feeling behind before o4-mini? 🚨🚨i'm open-sourcing beyond-nanoGPT, an internal codebase to help people go from LLM basics to research-level understanding. 🚨🚨 it contains thousands of lines of from-scratch, annotated pytorch implementing advanced…

6

47

318

305

374.0K

J

Jacob Springer@jacspringer · Mar 27

Meet a relevant problem these days

JJacob Springer@jacspringer · Mar 26

Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9

0

1

0

408