Catherine Arnett @ ACL 🇦🇹

@linguist_cat

NLP Researcher @AiEleuther. PhD @UCSanDiego Linguistics. Previously @pleiasfr @EdinburghUni. Interested in multilingual NLP, tokenizers, open science. She/her.

Boston, MA

Joined June 2022

525Following

738Followers

Pinned

Catherine Arnett @ ACL 🇦🇹@linguist_cat · Mar 7

✨New pre-print✨ Crosslingual transfer allows models to leverage their representations for one language to improve performance on another language. We characterize the acquisition of shared representations in order to better understand how and when crosslingual transfer happens.

linguist_cat's tweet image. ✨New pre-print✨ Crosslingual transfer allows models to leverage their representations for one language to improve performance on another language. We characterize the acquisition of shared representations in order to better understand how and when crosslingual transfer happens.

8.0K

Catherine Arnett @ ACL 🇦🇹@linguist_cat · Jul 22

If you want to help us improve language and cultural coverage, and build an open source LangID system, please register to our shared task! 💬 Registering is easy! All the details are on the shared task webpage: wmdqs.org/shared-task/ Deadline: July 23, 2025 (AoE) ⏰

CCommon Crawl Foundation@CommonCrawl · Jul 21

commoncrawl.org/blog/wmdqs-sha…

4.0K

Catherine Arnett @ ACL 🇦🇹@linguist_cat · Jul 19

Really grateful to the organizers for the recognition of @magikarp_tokens and my work!

TTokenization Workshop (TokShop) @ICML2025@tokshop2025 · Jul 19

🏆 Announcing our Best Paper Awards! 🥇 Winner: "BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization" openreview.net/forum?id=AO78C… 🥈 Runner-up: "One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression" openreview.net/forum?id=lC4xk… Congrats! 🎉

3.0K

Catherine Arnett @ ACL 🇦🇹 Retweeted

Cohere Labs@Cohere_Labs · Jul 18

We’re excited to share that work from our @Cohere colleague @magikarp_tokens, “BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization” will be highlighted today at @icmlconf in Vancouver! 🇨🇦 🎉Congrats to authors @magikarp_tokens and @linguist_cat.

3.0K

Catherine Arnett @ ACL 🇦🇹@linguist_cat · Jul 11

436

Catherine Arnett @ ACL 🇦🇹@linguist_cat · Jul 9

SCRIPT-BPE coming to ICML next week!

CCohere Labs@Cohere_Labs · Jul 9

We’re excited to share that two recent works from @Cohere and Cohere Labs, will be published at workshops next week at @icmlconf in Vancouver! 🇨🇦 🎉Congrats to all researchers with work presented! @simon_ycl, @cliangyu_, Sara Ahmadian, @mziizm, @magikarp_tokens, @linguist_ca

2.0K

Catherine Arnett @ ACL 🇦🇹 Retweeted

Sander Land@magikarp_tokens · Jul 2

Why do language models start by converting text to bytes? 🤔 UTF-8 solved a 1992 storage problem. LLMs have different needs. 🧵New post explaining how we can do better: Beyond Bytes ⮕ Fun fact: GPT-4o tokenizes that arrow as [b' \xe2', b'\xae', b'\x95\n\n'] 🤖💥

2.0K