Sander Land
@magikarp_tokens
Breaking all the models with weird tokens
🔠 UTF-8 was never meant for language models. Yet every major tokenizer still uses it, creating unfair "byte premiums". Why should your native script cost more to tokenize? It's time for a change. 🧵👇

🎤 Meet our expert panelists! Join Albert Gu, Alisa Liu, Kris Cao, Sander Land, and Yuval Pinter as they discuss the Future of Tokenization on July 18 at 3:30 PM at TokShop at #ICML2025.
Had a fantastic time at the Tokenization workshop, and really grateful for the recognition of our work with a best paper award.
🏆 Announcing our Best Paper Awards! 🥇 Winner: "BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization" openreview.net/forum?id=AO78C… 🥈 Runner-up: "One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression" openreview.net/forum?id=lC4xk… Congrats! 🎉
Congratulations to @linguist_cat and @magikarp_tokens on winning the best paper award at the #ICML2025 Tokenizer Workshop!
most controversial statement so far from @alisawuffles: "tokenization research is not as cool" **very vocals disagreements from crowd of tokenization nerds**
🔥tokenization panel!
We’re excited to share that work from our @Cohere colleague @magikarp_tokens, “BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization” will be highlighted today at @icmlconf in Vancouver! 🇨🇦 🎉Congrats to authors @magikarp_tokens and @linguist_cat.
SCRIPT-BPE coming to ICML next week!
We’re excited to share that two recent works from @Cohere and Cohere Labs, will be published at workshops next week at @icmlconf in Vancouver! 🇨🇦 🎉Congrats to all researchers with work presented! @simon_ycl, @cliangyu_, Sara Ahmadian, @mziizm, @magikarp_tokens, @linguist_ca
We are launching a new speaker series at EleutherAI, focused on promoting recent research by our team and community members. Our first talk is by @linguist_cat on tokenizers, their limitations, and how to improve them.
Huge congrats to all the authors @dianaabagyan, @alexrs95, @fffffelipec, @kroscoo, @learnlaughcry, @acyr_l, @mziizm, @ahmetustun89. I always enjoy collabs which tackle learning efficiency as an explicit design choice — rather than post training fixes. arxiv.org/abs/2506.10766
Thank you to co-authors @natolambert, @valentina__py, @magikarp_tokens, @jacobcares, @nlpnoah, and @HannaHajishirzi for a great collaboration! Read more in the paper here (ArXiv soon!): github.com/allenai/reward… Dataset, leaderboard, and models here: huggingface.co/collections/al…
I’m thrilled to share RewardBench 2 📊— We created a new multi-domain reward model evaluation that is substantially harder than RewardBench, we trained and released 70 reward models, and we gained insights about reward modeling benchmarks and downstream performance!
In other words: (graphic design by @magikarp_tokens)
Sander and I have been working on a new encoding scheme for tokenization which mitigates variable length byte sequences for different scripts, prevents partial UTF-8 byte tokens, and offers a simple and efficient pretokenization alternative to regular expressions!
🔠 UTF-8 was never meant for language models. Yet every major tokenizer still uses it, creating unfair "byte premiums". Why should your native script cost more to tokenize? It's time for a change. 🧵👇