Hugo

@Mldhug

PhD student in multimodal learning for audio understanding at @telecomparis. Intern @AIatMeta

Joined August 2019

406Following

49Followers

Pinned

Hugo@Mldhug · Oct 10

You want to give audio abilities to your VLM without compromising its vision performance? You want to align your audio encoder with a pretrained image encoder without suffering from the modality gap? Check our #NeurIPS2024 paper with @michelolzam @Steph_lat and Slim Essid

Mldhug's tweet image. You want to give audio abilities to your VLM without compromising its vision performance? You want to align your audio encoder with a pretrained image encoder without suffering from the modality gap?
Check our #NeurIPS2024 paper with @michelolzam @Steph_lat and Slim Essid

1.0K

Hugo@Mldhug · Apr 11

We are looking for audio and speech generation people, in Zurich, Paris or London to join our team at Google Deepmind. We build cutting-edge speech, music and audio (also audio-visual) generation capabilities. Reach out to Jason or me if interested. Retweets very appreciated !

JJason Baldridge@jasonbaldridge · Apr 10

Our incredible team built many models announced here, including image, voice, music and video generation! And: I'm moving to London this summer, and I'm hiring for research scientist and engineering roles! Our focus is on speech & music in Zurich, Paris & London. DM/email me.

4.0K

Hugo Retweeted

Alfredo Canziani@alfcnz · Dec 13

An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment

3.0K

Hugo@Mldhug · Dec 13

If you want to learn more about audio-visual alignment and how to use it to give audio abilities to your VLM, stop by our @NeurIPSConf poster #3602 (East exhibit hall A-C) tomorrow at 11am!

HHugo@Mldhug · Oct 10

209

Hugo Retweeted

Michel Olvera@michelolzam · Oct 17

Great talk today by @LiuHaohe at the @tp_adasp group on Latent Diffusion Models (LDMs) as versatile audio decoder! Walked us through diffusion basics, AudioLDM for text-to-audio, audio quality enhancement, and neural codecs!

3.0K

Hugo Retweeted

arXiv Sound@ArxivSound · Oct 10

``An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment,'' Hugo Malard, Michel Olvera, St\'ephane Lathuiliere, Slim Essid, ift.tt/lf5BrIC

567

Hugo@Mldhug · Sep 29, 2023

Given a number of ASR models of different sizes, how can I allocate an audio sample to the smallest one that will be good enough ? @Mldhug worked on this question during his internship, and ended up with interesting conclusions you will find in our paper !

HHugo@Mldhug · Sep 25, 2023

The preprint of our work (with @salah_zaiem and @AlgayresR) on sample dependent ASR model selection is available on arXiv! In this paper we propose to train a decision module, that allows, given an audio sample, to use the smallest sufficient model leading to a good transcription

1.0K

Hugo Retweeted

Maksym Andriushchenko@maksym_andr · Sep 16, 2023

It's really surprising how far one can go with *linear* predictors in the autoregressive setting. Interesting theory and experiments on TinyStories: a linear model (with 162M params :-) ) can generate totally coherent text with few grammatical mistakes. arxiv.org/abs/2309.06979

286

138

79.0K

Hugo Retweeted

Puyuan Peng@PuyuanPeng · Jul 7, 2023

Why is Whisper so robust to background noise? Not because Whisper suppresses them, but because Whisper 𝐮𝐧𝐝𝐞𝐫𝐬𝐭𝐚𝐧𝐝𝐬 them! Check out the amazing work by Yuan Gong @YGongND. They reveal this emergent capability of Whisper, and get SOTA *simultaneous* ASR + audio tagging

125

12.0K

Hugo Retweeted

Jack (in NYC till 8/1) Langerman@jacklangerman · Jun 29, 2023

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language Looks promising; I'll have to try and see if it stands upto some poking ;-) Love that they get around the need for multimodal training. ar5iv.org/abs/2306.16410 github.com/ContextualAI/l…

1.0K

Hugo Retweeted

AK@_akhaliq · Jun 30, 2023

DreamDiffusion: Generating High-Quality Images from Brain EEG Signals paper page: huggingface.co/papers/2306.16… paper introduces DreamDiffusion, a novel method for generating high-quality images directly from brain electroencephalogram (EEG) signals, without the need to translate…

142

589

206

137.0K

Hugo@Mldhug · Jun 16, 2023

Super excited to finally launch Voicebox🤩, the most versatile speech generative model ever💬👄 Demo page: voicebox.metademolab.com

AAI at Meta@AIatMeta · Jun 16, 2023

Introducing Voicebox, a new breakthrough generative speech system based on Flow Matching, a new method proposed by Meta AI. It can synthesize speech across six languages, perform noise removal, edit content, transfer audio style & more. More details on this work & examples ⬇️

186

416.0K

Hugo Retweeted

Lucas Beyer (bl16)@giffmana · Jun 16, 2023

Who killed non-contrastive image-text pretraining? @AlecRad and @_jongwook_kim with the below Fig2 in CLIP. Who collected the 7 Dragonballs and asked Shenron to resurrect it? Yours truly, in this new paper of ours. Generative captioning is not only competitive, it seems better!

574

299

212.0K

Hugo@Mldhug · Jun 16, 2023

Thanks for tweeting, @AK! We’re super excited about the future of text-only vision model selection! 🙏 @MarsScHuang @kcjacksonwang @syeung10

AAK@_akhaliq · Jun 16, 2023

LOVM: Language-Only Vision Model Selection paper page: huggingface.co/papers/2306.08… Pre-trained multi-modal vision-language models (VLMs) are becoming increasingly popular due to their exceptional performance on downstream vision applications, particularly in the few- and zero-shot…

19.0K

Hugo Retweeted

AK@_akhaliq · Jun 16, 2023

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration paper page: huggingface.co/papers/2306.09… Although instruction-tuned large language models (LLMs) have exhibited remarkable capabilities across various NLP tasks, their effectiveness on other data…

138

54.0K

Hugo@Mldhug · Jun 14, 2023

We'll present GeneCIS at #CVPR2023 (Highlight) TL;DR: While most image representations are *fixed*, we present a general way to train and evaluate models which can adapt to different *conditions* on the fly. Code: github.com/facebookresear… Project page: sgvaze.github.io/genecis/ 🧵

AAK@_akhaliq · Jun 14, 2023

GeneCIS: A Benchmark for General Conditional Image Similarity paper page: huggingface.co/papers/2306.07… argue that there are many notions of 'similarity' and that models, like humans, should be able to adapt to these dynamically. This contrasts with most representation learning…

34.0K