Adam Zweiger

@AdamZweiger

Rethinking how language models learn | Researcher @MIT_CSAIL

Joined September 2022

371Following

527Followers

Pinned

Adam Zweiger@AdamZweiger · Jun 13

Excited to share our new work on Self-Adapting Language Models! This is my first first-author paper and I’m grateful to be able to work with such an amazing team of collaborators: @jyo_pari @HanGuo97 @akyurekekin @yoonrkim @pulkitology

JJyo Pari@jyo_pari · Jun 13

What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.

8.0K

Adam Zweiger@AdamZweiger · Jul 23

Models that know what they know are way more useful. So, jointly reward correctness and calibration!

MMehul Damani@MehulDamani2 · Jul 23

🚨New Paper!🚨 We trained reasoning LLMs to reason about what they don't know. o1-style reasoning training improves accuracy but produces overconfident models that hallucinate more. Meet RLCR: a simple RL method that trains LLMs to reason and reflect on their uncertainty --…

360

Adam Zweiger@AdamZweiger · Jul 22

Since our initial arXiv post, several concurrent papers have introduced new architectures with log-linear properties in various forms. Two personal favorites of mine (among others) are: - Transformer-PSM by @MorrisYau et al., and - Radial Attention by Xingyang and @lmxyy1999 et…

HHan Guo@HanGuo97 · Jun 6

We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels

274

163

19.0K

Adam Zweiger@AdamZweiger · Jul 22

This is crazy! It makes more sense once you hear it requires both models to have the same initialization. If you can get a method like this to work without that, it would have big implications for data poisoning. I think it's not possible, but someone should look into it more.

OOwain Evans@OwainEvans_UK · Jul 22

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

1.0K

Adam Zweiger@AdamZweiger · Jul 21

the whole shrimp welfare thing is actually a great reductio ad absurdum for ea/rat. I suggest if you are interested in that stuff you focus your efforts on high-impact things that are more aligned with normal human intuition on morality.

fflorence 🦐@morallawwithin · Jul 20

Okay let's clarify some things. Link below

682

Adam Zweiger@AdamZweiger · Jul 20

interesting

DDave@dmvaldman · Jul 20

A striking thing about OpenAI's IMO gold math model is how terse it is, it really tries to express itself in single tokens. Often breaking the rules of grammar and spelling to do so. They say compression is intelligence. We may be seeing a totally novel way to do compression…

413

Adam Zweiger@AdamZweiger · Jul 19

A mathematician can think about a single problem for a full decade (perhaps 100M+ tokens of reading/writing/thinking) before solving it. When will we reach that point with LLMs?

1.0K

Adam Zweiger@AdamZweiger · Jul 16

why are all the quant booths at ICML on absolute steroids?? @OpenAI gotta build one of these things next time

825

Adam Zweiger@AdamZweiger · Jul 15

This is one of the highest-quality evals I've seen and it's nice to see it expanding! I love how you can view each model-problem-run datapoint

MMislav Balunović@mbalunovic · Jul 14

We are launching Project Euler on MathArena to track performance of LLMs on challenging new problems at the intersection of mathematics and programming which are published every week on Project Euler website 🧵(1/6)

452

Adam Zweiger@AdamZweiger · Jul 14

Come check out our ICML poster on combining Test-Time Training and In-Context Learning for on-the-fly adaptation to novel tasks like ARC-AGI puzzles. I will be presenting with @jyo_pari at E-2702, Tuesday 11-1:30!

AdamZweiger's tweet image. Come check out our ICML poster on combining Test-Time Training and In-Context Learning for on-the-fly adaptation to novel tasks like ARC-AGI puzzles.

I will be presenting with @jyo_pari at E-2702, Tuesday 11-1:30!

4.0K

Adam Zweiger Retweeted

Jyo Pari@jyo_pari · Jul 14

MoE Routers are trained a bit strangely but things seem to still work. @minyoung_huh and I got curious about combining specialized experts at test time through routing… and ended up deep in the weeds of MoE optimization. Here's a blog post! jyopari.github.io/posts/peculiar…

139

11.0K

Adam Zweiger@AdamZweiger · Jul 11

This will be the biggest architecture change since the transformer for language modeling.

SSukjun (June) Hwang@sukjun_hwang · Jul 11

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

945

Adam Zweiger@AdamZweiger · Jul 8

Very excited to see what they've cooked up now My out-there guess: Use SSMs to "tokenize" text into fewer but more semantic chunks, and do attention over that. State still grows linearly (and compute quadratically), but far fewer tokens and better expressivity for some domains

AAlbert Gu@_albertgu · Jul 8

I converted one of my favorite talks I've given over the past year into a blog post. "On the Tradeoffs of SSMs and Transformers" (or: tokens are bullshit) In a few days, we'll release what I believe is the next major advance for architectures.

572

Adam Zweiger@AdamZweiger · Jul 3

re @slatestarcodex on the human analogue of LLM hallucination: There is simply no human equivalent to what gpt4 does when being asked "what does GRPO stand for" or even "What is the capital of France" without anything else prompted in-context. The closest thing is someone…

364

Adam Zweiger Retweeted

Rylan Schaeffer@RylanSchaeffer · Jun 30

Third #ICML2025 paper! What effect will web-scale synthetic data have on future deep generative models? Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World 🔄 @JoshuaK92829 @ApratimDey2 @MGerstgrasser @rm_rafailov @sanmikoyejo 1/7

111

15.0K

Adam Zweiger@AdamZweiger · Jun 30

Reality doesn't match "exponential self-improvement" thought experiments. Still crucial to have safety, but it's increasingly clear that there will be no fast takeoff. Even models generating their own training data as in SEAL will improve gradually and under human oversight.

JJason Wei@_jasonwei · Jun 30

We don’t have AI self-improves yet, and when we do it will be a game-changer. With more wisdom now compared to the GPT-4 days, it's obvious that it will not be a “fast takeoff”, but rather extremely gradual across many years, probably a decade. The first thing to know is that…

477

Adam Zweiger@AdamZweiger · Jun 28

super bullish on meta, just really hope they keep open-weighting and do something like the llama 3 report again…

TTibor Blaho@btibor91 · Jun 28

The Information reports that Meta has hired four more OpenAI researchers - Shengjia Zhao (a key contributor to o1-mini and o3-mini reasoning models and former Stanford doctoral candidate), Jiahui Yu (who led OpenAI’s perception team), Shuchao Bi (OpenAI’s head of multimodal…

643

Adam Zweiger@AdamZweiger · Jun 28

If any undergrads follow me: apply for HackMIT!

HHackMIT@HackMIT · Jun 27

ONE WEEK LEFT until priority applications for HackMIT 2025 close!! If you haven’t yet applied, let’s recap what HackMIT has to offer:🧵 #HackMIT #Hackathon #mit

467

Adam Zweiger@AdamZweiger · Jun 26

Prediction: In 2030, >90% of training tokens for frontier models will be LLM-generated. Models are only going to get better at synthesizing new information with stuff in-context. With either SEAL or better heuristics, synth data quality will improve even further.

TThao Nguyen@thao_nguyen26 · Jun 23

Web data, the “fossil fuel of AI”, is being exhausted. What’s next?🤔 We propose Recycling the Web to break the data wall of pretraining via grounded synthetic data. It is more effective than standard data filtering methods, even with multi-epoch repeats! arxiv.org/abs/2506.04689

819