Alexander H. Liu
@alex_h_liu
Ph.D. Student @MIT_CSAIL
The Voxtral tech-report is up! arxiv.org/abs/2507.13264 We release these models with a permissive Apache 2.0 license. Feedback is welcome! We have a lot more cooking, this is just the beginning.
💡Bridging speech, sound, & music representations with one universal model? We introduce USAD ✅ 📚 Distills knowledge from domain-specific SSL models 🎯 Matches expert models across speech/audio/music tasks 📄 arxiv.org/abs/2506.18843 🧑💻 huggingface.co/MIT-SLS/USAD-B…
Highly recommended!!! (Happy to chat if you’re curious about the experience with the team)
Our team at NVIDIA is continuously looking for highly motivated interns to work on intelligence in audio understanding and synthesis. Please reach out if you would like to collaborate with us!
Turns out speech self-supervised learning technique can be generalized to sign language! Great work led by @Shester_G (he’s looking for PhD opportunity this year!)
Ever imagined a foundational model for sign language ?! Introducing SHuBERT(Sign Hidden Unit BERT)! With SHuBERT, we get SOTA results on ASL video understanding tasks compared to task-specific models from Google DeepMind, Meta, and Microsft, while using less compute ! 🧵 1/9
💚 Big shoutout to the #FUGATTO team for making this release happen — and to cats like Coltrane and Xenakis, who envisioned a world where "saxophones bark and howl." Together, artists and researchers, let’s build a GPT-like future for audio generation! fugatto.github.io
Q: Why can't we get GPT-level understanding from language models on speech? A: We need better speech tokens! In SyllableLM, *we beat @kyutai_labs Moshi on semantic understanding in 70 hours of training* by making speech tokens at 5 frames/s With @PuyuanPeng, David Harwath 1/n
Synthetic labels are amazing! Do you need an audio labelling machine? Audio Flamingo checkpoints are available on github.com/NVIDIA/audio-f… ...and pre-training with synthetic labels from Audio Flamingo gives large improvements in text-to-audio models arxiv.org/abs/2406.15487
Looking forward to meeting friends at #ICASSP2024

Beautiful work by Alex Liu on generative pre-training for speech with Flow Matching. I just realized it's one of the main components in AudioBox! arxiv.org/abs/2310.16338
Recent years have witnessed significant developments in audio codec models (an overview figure from arxiv.org/abs/2402.13236). We introduce Codec-SUPERB (arxiv.org/abs/2402.13071) to boost fair and comprehensive comparison. Leaderboard: codecsuperb.com
Lin-Shan: if no one asked you to attend the closing ceremony, you’re probably not getting the award (and laughed out loud)
Prof. Lin-Shan Lee remembers all his students… amazing…
LTU and LTU-AS codes are released. As usual, it is a full release including training and inference code, pretrained checkpoint, and the datasets. We hope these would be useful. Check github.com/YuanGongND/ltu.
I'll have a keynote talk at ASRU'23! asru2023.org/motion.asp?sit… See you soon in Taiwan! Actually, ASRU was the first conference that rejected my first-author paper (in 2003). But 20 years later, I was given the opportunity to be a keynote speaker, haha.
We summarize our lab's activities toward speech foundation models at wavlab.org/activities/202…. We have several other ongoing activities, and they are selected papers presented at ASRU.
🚀 Our upgraded audio large language model LTU-2 is now hosted on HuggingFace Space at lnkd.in/eJDpsBY4. Please have a try and let us know what you think 😀 .
🗣️ Whisper is great for speech recognition, but it only recognizes ~100 languages. What if it wasn't trained on the language that you speak? Happy to introduce my #INTERSPEECH2023 paper comparing Whisper and XLS-R for adaption to unseen languages! arxiv.org/abs/2305.12606