Catherine Arnett @ ACL 🇦🇹
@linguist_cat
NLP Researcher @AiEleuther. PhD @UCSanDiego Linguistics. Previously @pleiasfr @EdinburghUni. Interested in multilingual NLP, tokenizers, open science. She/her.
✨New pre-print✨ Crosslingual transfer allows models to leverage their representations for one language to improve performance on another language. We characterize the acquisition of shared representations in order to better understand how and when crosslingual transfer happens.

If you want to help us improve language and cultural coverage, and build an open source LangID system, please register to our shared task! 💬 Registering is easy! All the details are on the shared task webpage: wmdqs.org/shared-task/ Deadline: July 23, 2025 (AoE) ⏰
commoncrawl.org/blog/wmdqs-sha…
Really grateful to the organizers for the recognition of @magikarp_tokens and my work!
🏆 Announcing our Best Paper Awards! 🥇 Winner: "BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization" openreview.net/forum?id=AO78C… 🥈 Runner-up: "One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression" openreview.net/forum?id=lC4xk… Congrats! 🎉
We’re excited to share that work from our @Cohere colleague @magikarp_tokens, “BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization” will be highlighted today at @icmlconf in Vancouver! 🇨🇦 🎉Congrats to authors @magikarp_tokens and @linguist_cat.
SCRIPT-BPE coming to ICML next week!
We’re excited to share that two recent works from @Cohere and Cohere Labs, will be published at workshops next week at @icmlconf in Vancouver! 🇨🇦 🎉Congrats to all researchers with work presented! @simon_ycl, @cliangyu_, Sara Ahmadian, @mziizm, @magikarp_tokens, @linguist_ca
Why do language models start by converting text to bytes? 🤔 UTF-8 solved a 1992 storage problem. LLMs have different needs. 🧵New post explaining how we can do better: Beyond Bytes ⮕ Fun fact: GPT-4o tokenizes that arrow as [b' \xe2', b'\xae', b'\x95\n\n'] 🤖💥