Jacob Springer
@jacspringer
PhD student @mldcmu
Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9

Come to our #ICML2025 poster today to learn how overtraining language models can violate the main assumption we make when pretraining: that more data is better. Thursday 4:30pm @ East Exhibition Hall A-B #E-2508
Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9
📢 New paper on creativity & multi-token prediction! We design minimal open-ended tasks to argue: → LLMs are limited in creativity since they learn to predict the next token → creativity can be improved via multi-token learning & injecting noise ("seed-conditioning" 🌱) 1/ 🧵
RL with verifiable reward has shown impressive results in improving LLM reasoning, but what can we do when we do not have ground truth answers? Introducing Self-Rewarding Training (SRT): where language models provide their own reward for RL training! 🧵 1/n
Our work on the surprising negative effects of pre-training LMs for longer has received awards from both of these workshops and has been accepted to ICML 25! One of my first papers as a senior author :) arxiv.org/abs/2503.19206
5⃣ Overtraining (ICBINB oral and SCOPE oral; led by @jacspringer): training your LM for longer makes it harder to fine-tune. Exps on various model scales and downstream tasks, and theory characterizes this phenomenon in a simple transfer learning setting. arxiv.org/abs/2503.19206
#ICLR2025 #SCOPEICLR2025 That's wrap from @iclr_conf , first #ICLRatAsia, admittedly one of the most engaging conference I ever had. 📌Happy to see the overwhelming response at the #SCOPEICLR25 workshop! Thanks to the in-person co-runners @ayazdanb @Shiwei_Liu66! and congrats…
Came across this interesting research at @iclr_conf which finds out that pre-training after a certain scale (# of tokens) affects the fine-tuning performance.. had a great discussion with @jacspringer about it.
Excited to present our recent findings on "catastrophic overtraining" where more pre-training data (shockingly) can lead to worse downstream models.
🚨 Join us at the Workshop on Spurious Correlation & Shortcut Learning (SCSL) at #ICLR2025! @iclr_conf 🗓️ April 28, 2025 📍 Garnet 214-215, Singapore EXPO 🌐 More info: scslworkshop.github.io #ICLR2025
Come learn about how excessive pre-training can make your LLM harder to fine-tune at two #ICLR2025 workshops tomorrow (Monday): I'll be giving talks at: - ICBINB (10:30am Hall 4 #1) - SCOPE (2:30pm Peridot 204-205) Plus I'll be attending the poster sessions to chat!
Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9
Come check out our poster: Repetition Improves Language Model Embeddings at #ICLR2025! We convert autoregressive language models into high quality text embedders—optionally wihout additional training—just by repeating the input. Saturday April 26: 10am-12:30pm poster #176
Autoregressive language models (LLaMA, Mistral, etc) are fundamentally limited for text embeddings since they don’t encode information bidirectionally. We provide an easy fix: just repeat your input! We are the #1 fully-open-source model on MTEB! arxiv.org/abs/2402.15449 1/6
I won't be at ICLR this year, but I'm sharing my papers here, including an oral in the main conference and several orals in the workshops. Topics include: distillation, data selection for pre-training, benchmark + provable alg for unlearning, pref learning, and overtraining.
Excited to be at #ICLR2025 🇸🇬 to talk about my recent works👇that uncover key pitfalls & inefficiencies in pretraining & inference🚨. Final PhD lap —thinking a lot about how pretraining interventions can shape downstream behaviors (like reasoning & safety). DM to chat or vibe!
Are current reasoning models optimal for test-time scaling? 🌠 No! Models make the same incorrect guess over and over again. We show that you can fix this problem w/o any crazy tricks 💫 – just do weight ensembling (WiSE-FT) for big gains on math! 1/N
Looking beyond the next token TRELAWNEY inserts future tokens <T>...</T> during training to teach models to plan ahead—boosting reasoning, coherence, and control. Highlights: - NO ARCHITECTURE CHANGES. JUST SMARTER DATA. - works with standard decoding - enables controllable…
trained a nanoGPT? feeling behind before o4-mini? 🚨🚨i'm open-sourcing beyond-nanoGPT, an internal codebase to help people go from LLM basics to research-level understanding. 🚨🚨 it contains thousands of lines of from-scratch, annotated pytorch implementing advanced…
Meet a relevant problem these days
Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9