Ruochen Zhang

@ruochenz_

Interning @cohere, PhDing @Brown_NLP & @health_nlp, working on multilingual NLP and interpretability. Prev: Undergrad @sutdsg, she/they

Joined October 2014

1KFollowing

761Followers

Pinned

Ruochen Zhang@ruochenz_ · Jun 4

🤔Ever wonder why LLMs give inconsistent answers in different languages? In our paper, we identify two failure points in the multilingual factual recall process and propose fixes that guide LLMs to the "right path." This can boost performance by 35% in the weakest language! 📈

ruochenz_'s tweet image. 🤔Ever wonder why LLMs give inconsistent answers in different languages?

In our paper, we identify two failure points in the multilingual factual recall process and propose fixes that guide LLMs to the "right path." This can boost performance by 35% in the weakest language! 📈

16.0K

Ruochen Zhang@ruochenz_ · Jul 25

Sad to miss ACL in Vienna but so many of our members of SEACrowd are going to be there to present this work🔥 Reach out or find us in our merch 😉 Learn about our ongoing cool initiatives and how to participate or get our merch 😎

SSEACrowd@seacrowd_ai · Mar 13

SEA-VL: Building AI That Understands Southeast Asia 🇧🇳🇰🇭🇹🇱🇮🇩🇱🇦🇲🇾🇲🇲🇵🇭🇸🇬🇹🇭🇻🇳 We just released SEA-VL, the largest vision-language dataset tailored for SEA’s diverse culture. 📜 arXiv: arxiv.org/abs/2503.07920 🤗 Data: huggingface.co/collections/SE… Check the thread 🧵

1.0K

Ruochen Zhang@ruochenz_ · Jul 9

Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data…

AAi2@allen_ai · Jul 9

Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵

269

52.0K

Ruochen Zhang Retweeted

Etha Tianze Hua@EthaHua · Jul 9

Check out our new paper: “How Do Vision-Language Models Process Conflicting Information Across Modalities?”! Vision-language models often struggle with conflicting inputs - we show how their internal representations and key attention heads reveal when and how this happens, and…

2.0K

Ruochen Zhang Retweeted

Yukyung Lee@yukyunglee_ · Jul 2

Can coding agents autonomously implement AI research extensions? We introduce RExBench, a benchmark that tests if a coding agent can implement a novel experiment based on existing research and code. Finding: Most agents we tested had a low success rate, but there is promise!

138

20.0K

Ruochen Zhang@ruochenz_ · Jun 24

Excited to announce the call for papers for the Multilingual Representation Learning workshop #EMNLP2025 sigtyp.github.io/ws2025-mrl.html with @_dataman_ @linguist_cat Jiayi Wang @fdschmidt @tylerachang @hila_gonen and amazing speakers: Alice Oh, Kelly Marchisio, & Pontus Stenetorp

CCatherine Arnett @ ACL 🇦🇹@linguist_cat · Jun 24

The call for papers is out for the 5th edition of the Workshop on Multilingual Representation Learning which will take place in Suzhou, China co-located with EMNLP 2025! See details below!

3.0K

Ruochen Zhang Retweeted

EleutherAI@AiEleuther · Jun 23

We are launching a new speaker series at EleutherAI, focused on promoting recent research by our team and community members. Our first talk is by @linguist_cat on tokenizers, their limitations, and how to improve them.

151

15.0K

Ruochen Zhang Retweeted

EleutherAI@AiEleuther · Jun 6

Can you train a performant language models without using unlicensed text? We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&2

137

573

251

147.0K

Ruochen Zhang Retweeted

Ahmed Salem Elhady@ahsalem511 · Jun 6

📢 #acl2025 - main: 🤔Continued pretraining of LLMs in new languages often includes English data, but why? 💡We found English inclusion doesn't improve valid perplexity in the target language, yet critical for the emergence of abilities such as in-context learning! (1/5)

9.0K

Ruochen Zhang@ruochenz_ · Jun 2

Sander and I have been working on a new encoding scheme for tokenization which mitigates variable length byte sequences for different scripts, prevents partial UTF-8 byte tokens, and offers a simple and efficient pretokenization alternative to regular expressions!

SSander Land@magikarp_tokens · Jun 2

🔠 UTF-8 was never meant for language models. Yet every major tokenizer still uses it, creating unfair "byte premiums". Why should your native script cost more to tokenize? It's time for a change. 🧵👇

3.0K

Ruochen Zhang Retweeted

Amina Abdullahi@amilah_dul · Jun 2

New KDD 2025 paper: Can large language models (LLMs) reason like biomedical scientists? We introduce K-Paths, a retrieval framework for extracting reasoning paths from knowledge graphs (KGs) to aid drug discovery tasks. 👇 Thread:

1.0K