Andrew Rouditchenko 🇺🇦
@arouditchenko
PhD student at MIT working on multi-modal and multilingual speech. I was an intern at @AIatMeta and @Apple MLR.
Do you really need audio to fine-tune your Audio LLM? 🤔 Answer below: Introducing Omni-R1, a simple GRPO fine‑tuning method for Qwen2.5‑Omni on audio question answering. It sets new state‑of‑the‑art accuracies on the MMAU benchmark for Audio LLMs. arxiv.org/abs/2505.09439
mWhisper-Flamingo was accepted at IEEE Signal Processing Letters! To celebrate, I uploaded my presentation about it: youtu.be/NjeEZWO7m9I I would have submitted to Interspeech, but I couldn't travel during those dates. I'm hoping to present this at ICASSP 2026 in Spain!
If your PhD advisor dressed like this, you probably didn't use neural nets in your thesis
Finally, after all these years of being mocked, ffmpeg enthusiasts win!
💡Bridging speech, sound, & music representations with one universal model? We introduce USAD ✅ 📚 Distills knowledge from domain-specific SSL models 🎯 Matches expert models across speech/audio/music tasks 📄 arxiv.org/abs/2506.18843 🧑💻 huggingface.co/MIT-SLS/USAD-B…
Learn to figure out what is worth figuring out: kamperh.com/2025/06/20/kno…
Congrats to Edson for leading our Contrastive Audio-Visual Masked Autoencoders 2.0 Project (CAV-MAE Sync), accepted at #CVPR2025! Check out Edson's thread for more details ⬇️
🚀 Excited to announce our #CVPR2025 paper: CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment! We introduce a simple yet effective method for improved audio-visual learning. 🔗 Project: edsonroteia.github.io/cav-mae-sync/ 🧵 (1/7)👇
Granite-speech audio LLM from IBM. The level of data detail here is great especially comparing to ie. Whisper paper
``Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities,'' George Saon, Avihu Dekel, Alexander Brooks, Tohru Nagano, Abraham Daniels, Aharon Satt, Ashish Mittal, Brian Kingsbury, David Haws, Edmilson Morais, Gakuto Kurata, Ha… ift.tt/QPsxkH2
``Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities,'' George Saon, Avihu Dekel, Alexander Brooks, Tohru Nagano, Abraham Daniels, Aharon Satt, Ashish Mittal, Brian Kingsbury, David Haws, Edmilson Morais, Gakuto Kurata, Ha… ift.tt/QPsxkH2
``CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment,'' Edson Araujo, Andrew Rouditchenko, Yuan Gong, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Leonid Karlinsky, Rogerio Feris, James R. Glass, ift.tt/USw7Px0
"Has there been any case of theft on the world's largest barbecue?" 😆
Have you enjoyed talking to 🟢Moshi? Have you dreamt of making your own speech to speech chat experience🧑🔬🤖 ? It's now possible with the moshi-finetune codebase! Plug your own dataset and change the voice, the tone and the personality of Moshi 💚🔌💿. Here's an example after…
I'm curious about how OpenAI used RL for training their ASR models ("This methodology dramatically improves precision and reduces hallucination, making our speech-to-text solutions exceptionally competitive in complex speech recognition scenarios.") openai.com/index/introduc…
I'll present a dive into Moshi 🟢 and our translation model Hibiki 🇫🇷♻️🇬🇧 in the next @convAI2024 reading group👨🏫📗. 📅 13/03 🕰️ 11am ET, 4pm in Paris. I'll discuss Mimi 🗜️ and multistream audio modeling 🔊. Join on Zoom, replay on YT. ⬛⬛🟧🟧🟨🟨🟩🟩🟩⬛ ⬛🟧🟧🟨🟨🟩🟩🟩⬛⬛
📢 Join our Conversational AI Reading Group! 📅 Thursday, March 13 | 11 AM - 12 PM EST 🎙Speaker: Alexandre Defossez @honualx 📖 Topic: "Moshi: a speech-text foundation model for real-time dialogue" 🔗 Details: (poonehmousavi.github.io/rg)
Looking for 1 intern on audio-visual generation (potentially video2audio generation)! We have the largest computation resources in Japan, and we do serious industrial research (and development). DM if interested, and you can find more about me in my homepage.
Follow our initiative to boost Ukrainian speech technologies! huggingface.co/speech-uk