Adam Zweiger
@AdamZweiger
Rethinking how language models learn | Researcher @MIT_CSAIL
Excited to share our new work on Self-Adapting Language Models! This is my first first-author paper and I’m grateful to be able to work with such an amazing team of collaborators: @jyo_pari @HanGuo97 @akyurekekin @yoonrkim @pulkitology
What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.
Models that know what they know are way more useful. So, jointly reward correctness and calibration!
🚨New Paper!🚨 We trained reasoning LLMs to reason about what they don't know. o1-style reasoning training improves accuracy but produces overconfident models that hallucinate more. Meet RLCR: a simple RL method that trains LLMs to reason and reflect on their uncertainty --…
Since our initial arXiv post, several concurrent papers have introduced new architectures with log-linear properties in various forms. Two personal favorites of mine (among others) are: - Transformer-PSM by @MorrisYau et al., and - Radial Attention by Xingyang and @lmxyy1999 et…
We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels
This is crazy! It makes more sense once you hear it requires both models to have the same initialization. If you can get a method like this to work without that, it would have big implications for data poisoning. I think it's not possible, but someone should look into it more.
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
the whole shrimp welfare thing is actually a great reductio ad absurdum for ea/rat. I suggest if you are interested in that stuff you focus your efforts on high-impact things that are more aligned with normal human intuition on morality.
Okay let's clarify some things. Link below
interesting
A striking thing about OpenAI's IMO gold math model is how terse it is, it really tries to express itself in single tokens. Often breaking the rules of grammar and spelling to do so. They say compression is intelligence. We may be seeing a totally novel way to do compression…
A mathematician can think about a single problem for a full decade (perhaps 100M+ tokens of reading/writing/thinking) before solving it. When will we reach that point with LLMs?
why are all the quant booths at ICML on absolute steroids?? @OpenAI gotta build one of these things next time



This is one of the highest-quality evals I've seen and it's nice to see it expanding! I love how you can view each model-problem-run datapoint
We are launching Project Euler on MathArena to track performance of LLMs on challenging new problems at the intersection of mathematics and programming which are published every week on Project Euler website 🧵(1/6)
Come check out our ICML poster on combining Test-Time Training and In-Context Learning for on-the-fly adaptation to novel tasks like ARC-AGI puzzles. I will be presenting with @jyo_pari at E-2702, Tuesday 11-1:30!

MoE Routers are trained a bit strangely but things seem to still work. @minyoung_huh and I got curious about combining specialized experts at test time through routing… and ended up deep in the weeds of MoE optimization. Here's a blog post! jyopari.github.io/posts/peculiar…
This will be the biggest architecture change since the transformer for language modeling.
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
Very excited to see what they've cooked up now My out-there guess: Use SSMs to "tokenize" text into fewer but more semantic chunks, and do attention over that. State still grows linearly (and compute quadratically), but far fewer tokens and better expressivity for some domains
I converted one of my favorite talks I've given over the past year into a blog post. "On the Tradeoffs of SSMs and Transformers" (or: tokens are bullshit) In a few days, we'll release what I believe is the next major advance for architectures.
re @slatestarcodex on the human analogue of LLM hallucination: There is simply no human equivalent to what gpt4 does when being asked "what does GRPO stand for" or even "What is the capital of France" without anything else prompted in-context. The closest thing is someone…
Third #ICML2025 paper! What effect will web-scale synthetic data have on future deep generative models? Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World 🔄 @JoshuaK92829 @ApratimDey2 @MGerstgrasser @rm_rafailov @sanmikoyejo 1/7
Reality doesn't match "exponential self-improvement" thought experiments. Still crucial to have safety, but it's increasingly clear that there will be no fast takeoff. Even models generating their own training data as in SEAL will improve gradually and under human oversight.
We don’t have AI self-improves yet, and when we do it will be a game-changer. With more wisdom now compared to the GPT-4 days, it's obvious that it will not be a “fast takeoff”, but rather extremely gradual across many years, probably a decade. The first thing to know is that…
super bullish on meta, just really hope they keep open-weighting and do something like the llama 3 report again…
The Information reports that Meta has hired four more OpenAI researchers - Shengjia Zhao (a key contributor to o1-mini and o3-mini reasoning models and former Stanford doctoral candidate), Jiahui Yu (who led OpenAI’s perception team), Shuchao Bi (OpenAI’s head of multimodal…
If any undergrads follow me: apply for HackMIT!
ONE WEEK LEFT until priority applications for HackMIT 2025 close!! If you haven’t yet applied, let’s recap what HackMIT has to offer:🧵 #HackMIT #Hackathon #mit
Prediction: In 2030, >90% of training tokens for frontier models will be LLM-generated. Models are only going to get better at synthesizing new information with stuff in-context. With either SEAL or better heuristics, synth data quality will improve even further.
Web data, the “fossil fuel of AI”, is being exhausted. What’s next?🤔 We propose Recycling the Web to break the data wall of pretraining via grounded synthetic data. It is more effective than standard data filtering methods, even with multi-epoch repeats! arxiv.org/abs/2506.04689