William Chen
@chenwanch1
PhD Student @LTIatCMU @SCSatCMU | Masters @LTIatCMU | Formerly @TXInstruments | @UCF ‘21
What happens if you scale Whisper to billions of parameters? Our #ICML2025 paper develops scaling laws for ASR/ST models, training models with up to 18B params and 360K hours of data, and 100+ languages Joint work b/w @LTIatCMU and @nvidia arxiv.org/abs/2502.10373

Multilingual representation alignment through images (and without parallel data for new languages) : check out Nate's work #ACL2025NLP tomorrow. Paper: aclanthology.org/2025.acl-short… More details: 👇
Can multilingual text encoders borrow the semantic space from images to align their representations cross-lingually? I am presenting my paper “Cross-Lingual Representation Alignment Through Contrastive Image-Caption Tuning” at the 4pm poster session at ACL tomorrow! 🧵 (1/9)
Meows, music, murmurs and more! We train a general purpose audio encoder and open source the code, checkpoints and evaluation toolkit.
Shikhar Bharadwaj, Samuele Cornell, Kwanghee Choi, Satoru Fukayama, Hye-jin Shim, Soham Deshmukh, Shinji Watanabe, "OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder," arxiv.org/abs/2507.14129
OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder. arxiv.org/abs/2507.14129
The week ahead: 20250721-20250727 [Tweet] OpenAI’s new model achieves gold medal-level performance in IMO. [Blog] Calvin’s thoughts as he departs OpenAI. [Tweet] Thinking Machines Lab, the AI startup led by Mira Murati, OpenAI’s ex-CTO... Read more: llz.info/weekly
One of my favorite moments at #ICML2025 was being able to witness @_albertgu and the @cartesia_ai team’s reaction to Mamba being on the coffee sign. Felt surreal seeing someone realize their cultural impact.

I’ll be presenting this Thursday 4:30pm at the West hall, poster 418. Drop by to learn more about our latest experience in burning compute!
What happens if you scale Whisper to billions of parameters? Our #ICML2025 paper develops scaling laws for ASR/ST models, training models with up to 18B params and 360K hours of data, and 100+ languages Joint work b/w @LTIatCMU and @nvidia arxiv.org/abs/2502.10373
Presenting our #ICML2025 poster today! Discover our continuous, end-to-end approach that helps speech language models process speech prosody. Come learn more! 📍 W-411 (West Exhibition Hall B2 - B3) ⏰ 4:30 ~ 7:00 PM icml.cc/virtual/2025/p…
Thrilled to share our #ICML2025 paper! We introduce a variational approach for speech language models, automating speech attribute learning to deliver more natural, human-like speech. Joint work b/w @LTIatCMU and @Apple Read it: arxiv.org/abs/2506.14767
Not advertised yet, but we figured out how to do this too. And we release how exactly you can do it 👀. With the right training techniques, you can inject audio understanding and generation into an LLM with almost no loss in text perf. Details at arxiv.org/abs/2506.17611
the best part about the mistral release is that the models don't loose as much on text - this has been a biggest pain point for a audioLMs for a long while
how do yall think current day google translate works?? everyone's just stupid now i guess
twitter changed the embedded translation feature to "translate with grok" so now out of sheer spite i am going to learn every single language ever. fuck ai
What is it with speech reviewers on openreview? In my past 3 submissions (EMNLP 24, ICML 25, EMNLP 25), I have gotten only 1 reply to a rebuttal, out of a total of 11 reviews. Very frustrating, esp since they ask for more results and analyses that take a lot of time/compute.
🔊 New release: #ARECHO -> Autoregressive Evaluation via Chain-based Hypothesis Optimization. • 87-metric coverage in one model 🧮 • Dynamic classifier chain 🤝 • Unified tokenization 🧩 • Confidence-aware decoding 🛡️ Built on #UniVERSA, heading to #VERSA. More ↓
🚀 Happy to share our #INTERSPEECH2025 paper: Using speaker & acoustic context, we dynamically adjust model paths, resulting in a 25.7% relative BLEU improvement in speech translation. We also analyze how context influences model behavior. 📜 Paper: arxiv.org/abs/2505.18860
🚀 Introducing Uni-VERSA: a unified model for multi-dimensional speech evaluation-naturalness, intelligibility, noise, prosody & more. ⚡ 109× faster than native VERSA metric computation 🤗 Pretrained models + Colab demo 🧰 VERSA integration coming! 🔗 huggingface.co/collections/es…
Uni-VERSA: Versatile Speech Assessment with a Unified Network. arxiv.org/abs/2505.20741
I’ll be interning at Adobe Research in San Francisco this summer, working on audio generation. HMU if you’re in the area and want to chat about speech / audio AI!

7/7 papers accepted to #Interspeech2025 🎉 Lots of interesting work from my fantastic co-authors on long-form processing, multilingualism, and multi-modal foundation models. See y’all in Rotterdam 🇳🇱
Excited to share our survey paper accepted to #ACL2025NLP Findings: When Large Language Models Meet Speech: A Survey on Integration Approaches by Zhengdong Yang, Shuichiro Shimizu, Yahan Yu, Chenhui Chu (@knccch) 1/5
Do you really need audio to fine-tune your Audio LLM? 🤔 Answer below: Introducing Omni-R1, a simple GRPO fine‑tuning method for Qwen2.5‑Omni on audio question answering. It sets new state‑of‑the‑art accuracies on the MMAU benchmark for Audio LLMs. arxiv.org/abs/2505.09439