Ruochen Zhang
@ruochenz_
Interning @cohere, PhDing @Brown_NLP & @health_nlp, working on multilingual NLP and interpretability. Prev: Undergrad @sutdsg, she/they
🤔Ever wonder why LLMs give inconsistent answers in different languages? In our paper, we identify two failure points in the multilingual factual recall process and propose fixes that guide LLMs to the "right path." This can boost performance by 35% in the weakest language! 📈

Sad to miss ACL in Vienna but so many of our members of SEACrowd are going to be there to present this work🔥 Reach out or find us in our merch 😉 Learn about our ongoing cool initiatives and how to participate or get our merch 😎
SEA-VL: Building AI That Understands Southeast Asia 🇧🇳🇰🇭🇹🇱🇮🇩🇱🇦🇲🇾🇲🇲🇵🇭🇸🇬🇹🇭🇻🇳 We just released SEA-VL, the largest vision-language dataset tailored for SEA’s diverse culture. 📜 arXiv: arxiv.org/abs/2503.07920 🤗 Data: huggingface.co/collections/SE… Check the thread 🧵
Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data…
Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵
Check out our new paper: “How Do Vision-Language Models Process Conflicting Information Across Modalities?”! Vision-language models often struggle with conflicting inputs - we show how their internal representations and key attention heads reveal when and how this happens, and…
Can coding agents autonomously implement AI research extensions? We introduce RExBench, a benchmark that tests if a coding agent can implement a novel experiment based on existing research and code. Finding: Most agents we tested had a low success rate, but there is promise!
Excited to announce the call for papers for the Multilingual Representation Learning workshop #EMNLP2025 sigtyp.github.io/ws2025-mrl.html with @_dataman_ @linguist_cat Jiayi Wang @fdschmidt @tylerachang @hila_gonen and amazing speakers: Alice Oh, Kelly Marchisio, & Pontus Stenetorp
The call for papers is out for the 5th edition of the Workshop on Multilingual Representation Learning which will take place in Suzhou, China co-located with EMNLP 2025! See details below!
We are launching a new speaker series at EleutherAI, focused on promoting recent research by our team and community members. Our first talk is by @linguist_cat on tokenizers, their limitations, and how to improve them.
Can you train a performant language models without using unlicensed text? We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&2
📢 #acl2025 - main: 🤔Continued pretraining of LLMs in new languages often includes English data, but why? 💡We found English inclusion doesn't improve valid perplexity in the target language, yet critical for the emergence of abilities such as in-context learning! (1/5)
Sander and I have been working on a new encoding scheme for tokenization which mitigates variable length byte sequences for different scripts, prevents partial UTF-8 byte tokens, and offers a simple and efficient pretokenization alternative to regular expressions!
🔠 UTF-8 was never meant for language models. Yet every major tokenizer still uses it, creating unfair "byte premiums". Why should your native script cost more to tokenize? It's time for a change. 🧵👇
New KDD 2025 paper: Can large language models (LLMs) reason like biomedical scientists? We introduce K-Paths, a retrieval framework for extracting reasoning paths from knowledge graphs (KGs) to aid drug discovery tasks. 👇 Thread: