Alisa Liu
@alisawuffles
PhD student at @uwcse @uwnlp
We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

Three invited speakers will share their insights at TokShop! Hear from Yuval Pinter @yuvalpi, Desmond Elliott @delliott, and Adrian Łańcuck @AdrianLancuckii on cutting-edge tokenization research. Don't miss these keynote presentations! #ICML2025 tokenization-workshop.github.io/speakers
Very honored & super excited to be on this panel! If you're curious about tokenization but not sure what it’s about or what to think, come hear us attempt to demystify things! Also I’m at ICML all week, please reach out if you’d like to chat 🦙
🎤 Meet our expert panelists! Join Albert Gu, Alisa Liu, Kris Cao, Sander Land, and Yuval Pinter as they discuss the Future of Tokenization on July 18 at 3:30 PM at TokShop at #ICML2025.
Attending #ICML2025? Don't miss this TokShop panel, which will explore: 🔮 The Future of Tokenization 🔮 Featuring a stellar lineup of panelists - mark your calendar! ✨
🎤 Meet our expert panelists! Join Albert Gu, Alisa Liu, Kris Cao, Sander Land, and Yuval Pinter as they discuss the Future of Tokenization on July 18 at 3:30 PM at TokShop at #ICML2025.
Our tutorial on synthetic data generation is starting in 20 minutes, come check it out if you're at ACL! (or check out the slides if you're not)
We’ve prepared a tutorial for ACL this year to give you some answers. Come join @xiangyue96, @alisawuffles, @yizhongwyz, @gneubig, and me for “Synthetic Data in the Era of LLMs.” 📍 Sunday 2–3:30pm, Hall B #ACL2025
We’ve prepared a tutorial for ACL this year to give you some answers. Come join @xiangyue96, @alisawuffles, @yizhongwyz, @gneubig, and me for “Synthetic Data in the Era of LLMs.” 📍 Sunday 2–3:30pm, Hall B #ACL2025
Happy to present OLMoTrace at #ACL2025NLP next week!! 🤗 If you stop by the demo session on Tuesday, July 29, 10:30am-12pm, @yanaiela and @sewon__min will be sharing how we use OLMoTrace to make LLMs more transparent. Unfortunately I'm unable to attend in-person due to visa 🥹
Today we're unveiling OLMoTrace, a tool that enables everyone to understand the outputs of LLMs by connecting to their training data. We do this on unprecedented scale and in real time: finding matching text between model outputs and 4 trillion training tokens within seconds. ✨
But the hotter for me was that there is no such a thing as killing them. Because you need some input and that is your tokenized data.
Surprisingly good tokenization workshop, resurfaced thoughts:🧠📈🤖 Why isn't tokenization learned? Can we do an evolutionary algorithm, or train a tokenization scheme on the pretrained or meta learn on something fast-pretrain (loss at the beginning of training?) Let's discuss👇
🔠 UTF-8 was never meant for language models. Yet every major tokenizer still uses it, creating unfair "byte premiums". Why should your native script cost more to tokenize? It's time for a change. 🧵👇
Talking to researchers who pretrain LLMs, we got several insightful questions on how to use SuperBPE in practice. This new Blog summarizes the best practice: superbpe.github.io/faq.html written by @JonathanHayase and @alisawuffles.
SuperBPE is accepted to COLM (w/ three 9s)!🚀 We also wrote a blog w/ new results & suggestions after working with lots of folks on SuperBPE. Highlights: tokenizer efficiency adds a new dimension to scaling laws, and SuperBPE models are up to 2x faster in long context regimes!⬇️
Jon's algorithm detaches LMs from the tokenizers they were trained with, completely at inference time! As a side effect, it means we've finally overcome tokenizer mismatches for LM ensembles like proxy-tuning!! Now you can "try on"👕 any other LM's post-training on your base LM.
Do you ever wish all LLMs used the same tokenizer?🧑🤝🧑 We present an *efficient, lossless* method to convert any LM into a byte-level model at inference time. This fixes weird tokenization artifacts at the prompt boundary and enables ensembles of LMs with mismatched tokenizers! 🧵
amazing paper! specially loved the section on low-loss tokens which exactly fit our earlier results on why beyond-word-BPE could benefit from PMI-merging instead of by frequency. aclanthology.org/2022.insights-… (see bigrams for PMI merging and Freq for typical BPE merging by frequency)
We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵
Do you ever wish all LLMs used the same tokenizer?🧑🤝🧑 We present an *efficient, lossless* method to convert any LM into a byte-level model at inference time. This fixes weird tokenization artifacts at the prompt boundary and enables ensembles of LMs with mismatched tokenizers! 🧵
🎉 We’re excited to introduce BLAB: Brutally Long Audio Bench, the first benchmark for evaluating long-form reasoning in audio LMs across 8 challenging tasks, using 833+ hours of Creative Commons audio. (avg length: 51 minutes).
Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data…
Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵
We are launching a new speaker series at EleutherAI, focused on promoting recent research by our team and community members. Our first talk is by @linguist_cat on tokenizers, their limitations, and how to improve them.