Alisa Liu (@alisawuffles)

Pinned

A

Alisa Liu@alisawuffles · Mar 21

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

alisawuffles's tweet image. We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words.

When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

95

328

3.0K

1.0K

362.0K

Pinned

Alisa Liu Retweeted

T

Tokenization Workshop (TokShop) @ICML2025@tokshop2025 · Jul 16

Three invited speakers will share their insights at TokShop! Hear from Yuval Pinter @yuvalpi, Desmond Elliott @delliott, and Adrian Łańcuck @AdrianLancuckii on cutting-edge tokenization research. Don't miss these keynote presentations! #ICML2025 tokenization-workshop.github.io/speakers

0

4

8

0

2.0K

Pinned

A

Alisa Liu@alisawuffles · Jul 16

Very honored & super excited to be on this panel! If you're curious about tokenization but not sure what it’s about or what to think, come hear us attempt to demystify things! Also I’m at ICML all week, please reach out if you’d like to chat 🦙

TTokenization Workshop (TokShop) @ICML2025@tokshop2025 · Jul 15

🎤 Meet our expert panelists! Join Albert Gu, Alisa Liu, Kris Cao, Sander Land, and Yuval Pinter as they discuss the Future of Tokenization on July 18 at 3:30 PM at TokShop at #ICML2025.

0

4

45

1

2.0K

Pinned

A

Alisa Liu@alisawuffles · Jul 16

Attending #ICML2025? Don't miss this TokShop panel, which will explore: 🔮 The Future of Tokenization 🔮 Featuring a stellar lineup of panelists - mark your calendar! ✨

TTokenization Workshop (TokShop) @ICML2025@tokshop2025 · Jul 15

🎤 Meet our expert panelists! Join Albert Gu, Alisa Liu, Kris Cao, Sander Land, and Yuval Pinter as they discuss the Future of Tokenization on July 18 at 3:30 PM at TokShop at #ICML2025.

0

3

25

1

1.0K

A

Alisa Liu@alisawuffles · Jul 27

Our tutorial on synthetic data generation is starting in 20 minutes, come check it out if you're at ACL! (or check out the slides if you're not)

VVijay V.@vijaytarian · Jul 26

We’ve prepared a tutorial for ACL this year to give you some answers. Come join @xiangyue96, @alisawuffles, @yizhongwyz, @gneubig, and me for “Synthetic Data in the Era of LLMs.” 📍 Sunday 2–3:30pm, Hall B #ACL2025

2

50

21

5.0K

Alisa Liu Retweeted

V

Vijay V.@vijaytarian · Jul 26

We’ve prepared a tutorial for ACL this year to give you some answers. Come join @xiangyue96, @alisawuffles, @yizhongwyz, @gneubig, and me for “Synthetic Data in the Era of LLMs.” 📍 Sunday 2–3:30pm, Hall B #ACL2025

1

8

55

18

13.0K

A

Alisa Liu@alisawuffles · Jul 23

Happy to present OLMoTrace at #ACL2025NLP next week!! 🤗 If you stop by the demo session on Tuesday, July 29, 10:30am-12pm, @yanaiela and @sewon__min will be sharing how we use OLMoTrace to make LLMs more transparent. Unfortunately I'm unable to attend in-person due to visa 🥹

JJiacheng Liu@liujc1998 · Apr 9

Today we're unveiling OLMoTrace, a tool that enables everyone to understand the outputs of LLMs by connecting to their training data. We do this on unprecedented scale and in real time: finding matching text between model outputs and 4 trillion training tokens within seconds. ✨

0

11

39

6

6.0K

Alisa Liu Retweeted

L

Leshem (Legend) Choshen 🤖🤗 @ACL@LChoshen · Jul 19

But the hotter for me was that there is no such a thing as killing them. Because you need some input and that is your tokenized data.

0

1

5

0

453

Alisa Liu Retweeted

L

Leshem (Legend) Choshen 🤖🤗 @ACL@LChoshen · Jul 19

Surprisingly good tokenization workshop, resurfaced thoughts:🧠📈🤖 Why isn't tokenization learned? Can we do an evolutionary algorithm, or train a tokenization scheme on the pretrained or meta learn on something fast-pretrain (loss at the beginning of training?) Let's discuss👇

1

7

32

14

4.0K

Alisa Liu Retweeted

S

Sander Land@magikarp_tokens · Jun 2

🔠 UTF-8 was never meant for language models. Yet every major tokenizer still uses it, creating unfair "byte premiums". Why should your native script cost more to tokenize? It's time for a change. 🧵👇

5

39

303

155

35.0K

A

Alisa Liu@alisawuffles · Jul 15

Talking to researchers who pretrain LLMs, we got several insightful questions on how to use SuperBPE in practice. This new Blog summarizes the best practice: superbpe.github.io/faq.html written by @JonathanHayase and @alisawuffles.

AAlisa Liu@alisawuffles · Jul 14

SuperBPE is accepted to COLM (w/ three 9s)!🚀 We also wrote a blog w/ new results & suggestions after working with lots of folks on SuperBPE. Highlights: tokenizer efficiency adds a new dimension to scaling laws, and SuperBPE models are up to 2x faster in long context regimes!⬇️

0

6

38

13

5.0K

A

Alisa Liu@alisawuffles · Jul 15

Jon's algorithm detaches LMs from the tokenizers they were trained with, completely at inference time! As a side effect, it means we've finally overcome tokenizer mismatches for LM ensembles like proxy-tuning!! Now you can "try on"👕 any other LM's post-training on your base LM.

JJonathan Hayase@JonathanHayase · Jul 15

Do you ever wish all LLMs used the same tokenizer?🧑‍🤝‍🧑 We present an *efficient, lossless* method to convert any LM into a byte-level model at inference time. This fixes weird tokenization artifacts at the prompt boundary and enables ensembles of LMs with mismatched tokenizers! 🧵

1

2

26

10

2.0K

A

Alisa Liu@alisawuffles · Jul 14

amazing paper! specially loved the section on low-loss tokens which exactly fit our earlier results on why beyond-word-BPE could benefit from PMI-merging instead of by frequency. aclanthology.org/2022.insights-… (see bigrams for PMI merging and Freq for typical BPE merging by frequency)

AAlisa Liu@alisawuffles · Mar 21

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

1

3

8

4

2.0K

Alisa Liu Retweeted

J

Jonathan Hayase@JonathanHayase · Jul 15

Do you ever wish all LLMs used the same tokenizer?🧑‍🤝‍🧑 We present an *efficient, lossless* method to convert any LM into a byte-level model at inference time. This fixes weird tokenization artifacts at the prompt boundary and enables ensembles of LMs with mismatched tokenizers! 🧵

3

32

171

150

20.0K

Alisa Liu Retweeted

O

Oreva Ahia@orevaahia · Jul 11

🎉 We’re excited to introduce BLAB: Brutally Long Audio Bench, the first benchmark for evaluating long-form reasoning in audio LMs across 8 challenging tasks, using 833+ hours of Creative Commons audio. (avg length: 51 minutes).

2

45

163

34

13.0K

A

Alisa Liu@alisawuffles · Jul 9

Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data…

AAi2@allen_ai · Jul 9

Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵

9

83

269

86

52.0K

Alisa Liu Retweeted

E

EleutherAI@AiEleuther · Jun 23

We are launching a new speaker series at EleutherAI, focused on promoting recent research by our team and community members. Our first talk is by @linguist_cat on tokenizers, their limitations, and how to improve them.

2

25

151

67

15.0K