Kevin Li

@kevinyli_

phd @mldcmu undergrad @georgiatech

Joined October 2021

192Following

531Followers

Pinned

Kevin Li@kevinyli_ · Aug 20

Attention is all you need; at least the matrices are, if you want to distill Transformers into alternative architectures, like Mamba, with our new distillation method: MOHAWK! We also release a fully subquadratic, performant 1.5B model distilled from Phi-1.5 with only 3B tokens!

kevinyli_'s tweet image. Attention is all you need; at least the matrices are, if you want to distill Transformers into alternative architectures, like Mamba, with our new distillation method: MOHAWK!

We also release a fully subquadratic, performant 1.5B model distilled from Phi-1.5 with only 3B tokens!

471

279

107.0K

Kevin Li Retweeted

Demis Hassabis@demishassabis · Jul 21

Official results are in - Gemini achieved gold-medal level in the International Mathematical Olympiad! 🏆 An advanced version was able to solve 5 out of 6 problems. Incredible progress - huge congrats to @lmthang and the team! deepmind.google/discover/blog/…

199

764

6.0K

638

1.4M

Kevin Li Retweeted

Alexander Wei@alexwei_ · Jul 19

1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).

404

1.0K

7.0K

2.0K

5.2M

Kevin Li Retweeted

Gaurav Ghosal@gaurav_ghosal · Jul 17

1/So much of privacy research is designing post-hoc methods to make models mem. free. It’s time we turn that around with architectural changes. Excited to add Memorization Sinks to the transformer architecture this #ICML2025 to isolate memorization during LLM training🧵

7.0K

Kevin Li Retweeted

Pratyush Maini@pratyushmaini · Jul 16

At #ICML2025, I am super excited to introduce STAMP. This is a marriage b/w dataset inference & watermarking that finally(!) lets creators PROVE their content was used to train LLMs🔍 Its a MAJOR push taking the academic problem into real world. w/Saksham Rastogi @danish037 🧵

102

13.0K

Kevin Li Retweeted

Jennifer Hsia@jen_hsia · Jul 16

1/6 Retrieval is supposed to improve generation in RAG systems. But in practice, adding more documents can hurt performance, even when relevant ones are retrieved. We introduce RAGGED, a framework to measure and diagnose when retrieval helps and when it hurts.

106

9.0K

Kevin Li Retweeted

Sukjun (June) Hwang@sukjun_hwang · Jul 11

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

686

5.0K

4.0K

693.0K

Kevin Li Retweeted

Albert Gu@_albertgu · Jul 8

I converted one of my favorite talks I've given over the past year into a blog post. "On the Tradeoffs of SSMs and Transformers" (or: tokens are bullshit) In a few days, we'll release what I believe is the next major advance for architectures.

113

780

540

113.0K

Kevin Li Retweeted

elie@eliebakouch · Jul 8

Super excited to share SmolLM3, a new strong 3B model. SmolLM3 is fully open, we share the recipe, the dataset, the training codebase and much more! > Train on 11T token on 384 H100 for 220k GPU hours > Support long context up to 128k thanks to NoPE and intra document masking >…

137

835

474

114.0K

Kevin Li Retweeted

Ricardo Buitrago@rbuit_ · Jul 7

Despite theoretically handling long contexts, existing recurrent models still fall short: they may fail to generalize past the training length. We show a simple and general fix which enables length generalization in up to 256k sequences, with no need to change the architectures!

194

118

39.0K

Kevin Li Retweeted

YixuanEvenXu@YixuanEvenXu · Jun 10

✨ Did you know that NOT using all generated rollouts in GRPO can boost your reasoning LLM? Meet PODS! We down-sample rollouts and train on just a fraction, delivering notable gains over vanilla GRPO. (1/7)

135

109

17.0K

Kevin Li Retweeted

Avi Schwarzschild@A_v_i__S · Jun 10

Big news! 🎉 I’m joining UNC-Chapel Hill as an Assistant Professor in Computer Science starting next year! Before that, I’ll be spending time @OpenAI working on LLM privacy. @unccs @uncnlp

577

43.0K

Kevin Li Retweeted

Omar Shaikh@oshaikh13 · Jun 9

What if LLMs could learn your habits and preferences well enough (across any context!) to anticipate your needs? In a new paper, we present the General User Model (GUM): a model of you built from just your everyday computer use. 🧵

336

197

57.0K

Kevin Li Retweeted

Sabri Eyuboglu@EyubogluSabri · Jun 9

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x…

298

216

63.0K

Kevin Li Retweeted

Anthony Peng@RealAnthonyPeng · Jun 3

🚨 Sharing our new #ACL2025NLP main paper! 🎥 Deploying video VLMs at scale? Inference compute is your bottleneck. We study how to optimally allocate inference FLOPs across LLM size, frame count, and visual tokens. 💡 Large-scale training sweeps (~100k A100 hrs) 📊 Parametric…

2.0K

Kevin Li Retweeted

Vaishnavh Nagarajan@_vaishnavh · Jun 2

📢 New paper on creativity & multi-token prediction! We design minimal open-ended tasks to argue: → LLMs are limited in creativity since they learn to predict the next token → creativity can be improved via multi-token learning & injecting noise ("seed-conditioning" 🌱) 1/ 🧵

166

112

27.0K

Kevin Li@kevinyli_ · May 29

We've been thinking about what the "ideal" architecture should look like in the era where inference is driving AI progress. GTA & GLA are steps in this direction: attention variants tailored for inference: high arithmetic intensity (make GPUs go brr even during decoding), easy to…

TTed Zadouri@tedzadouri · May 29

"Pre-training was hard, inference easy; now everything is hard."-Jensen Huang. Inference drives AI progress b/c of test-time compute. Introducing inference aware attn: parallel-friendly, high arithmetic intensity – Grouped-Tied Attn & Grouped Latent Attn

464

358

54.0K

Kevin Li@kevinyli_ · May 28

One fundamental issue with RL – whether it’s for robots or LLMs – is how hard it is to get rewards. For LLM reasoning, we need ground-truth labels to verify answers. We found that maximizing confidence alone allows LLMs to improve their reasoning with RL!

MMihir Prabhudesai@mihirp98 · May 28

Excited to share our work: Maximizing Confidence Alone Improves Reasoning Humans rely on confidence to learn when answer keys aren’t available (e.g taking an exam). Surprisingly, LLMs can also learn w/o ground-truth answers, simply by reinforcing high-confidence answers via RL!

129

11.0K

Kevin Li Retweeted

Fahim Tajwar@FahimTajwar10 · May 28

RL with verifiable reward has shown impressive results in improving LLM reasoning, but what can we do when we do not have ground truth answers? Introducing Self-Rewarding Training (SRT): where language models provide their own reward for RL training! 🧵 1/n

143

838

865

82.0K

Kevin Li Retweeted

Anthony Peng@RealAnthonyPeng · May 26

🚨 New work: We rethink how we finetune safer LLMs — not by filtering after the generation, but by tracking safety risk token by token during training. We repurpose guardrail models like 🛡️ Llama Guard and Granite Guardian to score evolving risk across each response 📉 — giving…

8.0K

Kevin Li Retweeted

Songlin Yang@SonglinYang4 · May 24

📢 (1/16) Introducing PaTH 🛣️ — a RoPE-free contextualized position encoding scheme, built for stronger state tracking, better extrapolation, and hardware-efficient training. PaTH outperforms RoPE across short and long language modeling benchmarks arxiv.org/abs/2505.16381

506

302

67.0K