Sander Land

@magikarp_tokens

Breaking all the models with weird tokens

ម្បី᥀$PostalCodesNL / Oslo

Joined March 2024

84Following

1KFollowers

Pinned

Sander Land@magikarp_tokens · Jun 2

🔠 UTF-8 was never meant for language models. Yet every major tokenizer still uses it, creating unfair "byte premiums". Why should your native script cost more to tokenize? It's time for a change. 🧵👇

magikarp_tokens's tweet image. 🔠 UTF-8 was never meant for language models.
Yet every major tokenizer still uses it, creating unfair "byte premiums".
Why should your native script cost more to tokenize? It's time for a change. 🧵👇

301

153

35.0K

Pinned

Sander Land Retweeted

Tokenization Workshop (TokShop) @ICML2025@tokshop2025 · Jul 15

🎤 Meet our expert panelists! Join Albert Gu, Alisa Liu, Kris Cao, Sander Land, and Yuval Pinter as they discuss the Future of Tokenization on July 18 at 3:30 PM at TokShop at #ICML2025.

6.0K

Sander Land@magikarp_tokens · Jul 19

Had a fantastic time at the Tokenization workshop, and really grateful for the recognition of our work with a best paper award.

TTokenization Workshop (TokShop) @ICML2025@tokshop2025 · Jul 19

🏆 Announcing our Best Paper Awards! 🥇 Winner: "BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization" openreview.net/forum?id=AO78C… 🥈 Runner-up: "One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression" openreview.net/forum?id=lC4xk… Congrats! 🎉

792

Sander Land Retweeted

EleutherAI@AiEleuther · Jul 19

Congratulations to @linguist_cat and @magikarp_tokens on winning the best paper award at the #ICML2025 Tokenizer Workshop!

106

6.0K

Sander Land@magikarp_tokens · Jul 18

most controversial statement so far from @alisawuffles: "tokenization research is not as cool" **very vocals disagreements from crowd of tokenization nerds**

VValentina Pyatkin@valentina__py · Jul 18

🔥tokenization panel!

9.0K

Sander Land Retweeted

Cohere Labs@Cohere_Labs · Jul 18

We’re excited to share that work from our @Cohere colleague @magikarp_tokens, “BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization” will be highlighted today at @icmlconf in Vancouver! 🇨🇦 🎉Congrats to authors @magikarp_tokens and @linguist_cat.

4.0K

Sander Land@magikarp_tokens · Jul 9

SCRIPT-BPE coming to ICML next week!

CCohere Labs@Cohere_Labs · Jul 9

We’re excited to share that two recent works from @Cohere and Cohere Labs, will be published at workshops next week at @icmlconf in Vancouver! 🇨🇦 🎉Congrats to all researchers with work presented! @simon_ycl, @cliangyu_, Sara Ahmadian, @mziizm, @magikarp_tokens, @linguist_ca

2.0K

Sander Land Retweeted

EleutherAI@AiEleuther · Jun 23

We are launching a new speaker series at EleutherAI, focused on promoting recent research by our team and community members. Our first talk is by @linguist_cat on tokenizers, their limitations, and how to improve them.

151

15.0K

Sander Land Retweeted

Sara Hooker@sarahookr · Jun 17

Huge congrats to all the authors @dianaabagyan, @alexrs95, @fffffelipec, @kroscoo, @learnlaughcry, @acyr_l, @mziizm, @ahmetustun89. I always enjoy collabs which tackle learning efficiency as an explicit design choice — rather than post training fixes. arxiv.org/abs/2506.10766

2.0K

Sander Land Retweeted

Saumya Malik@saumyamalik44 · Jun 2

Thank you to co-authors @natolambert, @valentina__py, @magikarp_tokens, @jacobcares, @nlpnoah, and @HannaHajishirzi for a great collaboration! Read more in the paper here (ArXiv soon!): github.com/allenai/reward… Dataset, leaderboard, and models here: huggingface.co/collections/al…

5.0K

Sander Land Retweeted

Saumya Malik@saumyamalik44 · Jun 2

I’m thrilled to share RewardBench 2 📊— We created a new multi-domain reward model evaluation that is substantially harder than RewardBench, we trained and released 70 reward models, and we gained insights about reward modeling benchmarks and downstream performance!

232

112

35.0K

Sander Land Retweeted

Catherine Arnett @ ACL 🇦🇹@linguist_cat · Jun 2

In other words: (graphic design by @magikarp_tokens)

301

Sander Land@magikarp_tokens · Jun 2

Sander and I have been working on a new encoding scheme for tokenization which mitigates variable length byte sequences for different scripts, prevents partial UTF-8 byte tokens, and offers a simple and efficient pretokenization alternative to regular expressions!

SSander Land@magikarp_tokens · Jun 2

3.0K