Abraham Owodunni
@AbrahamOwos
Exploring multilingual NLP & Speech. PhDing @osunlp. Organizer and member @masakhanenlp, @mrl2024_emnlp, ex: Researcher @Intronhealth
ML Collective is bridging the gap in AI by making it possible for these young researchers to attend @DeepIndaba ’s annual conference this year. Please consider supporting these undergrads by donating to this cause. They've all shared their stories in the video from this tweet ❤️.
The opportunity gap in AI is more striking than ever. We talk way too much about those receiving $100M or whatever for their jobs, but not enough those asking for <$1k to present their work. For 3rd year in a row, @ml_collective is raising funds to support @DeepIndaba attendees.
The opportunity gap in AI is more striking than ever. We talk way too much about those receiving $100M or whatever for their jobs, but not enough those asking for <$1k to present their work. For 3rd year in a row, @ml_collective is raising funds to support @DeepIndaba attendees.
It's interesting to see that memorization is still a huge problem to be addressed.
Prompting Llama 3.1 70B with the “Mr and Mrs. D” can generate seed the generation of a near-exact copy of the entire ~300 page book ‘Harry Potter & the Sorcerer’s Stone’ 🤯 We define a “near-copy” as text that is identical modulo minor spelling / punctuation variations. Below…
Don’t forget to stop by for the second round of our presentation at the workshop today.
We’re thrilled to share our latest work: FLEXITOKENS! In this work, we introduce language models with learnable tokenizers for making tokenization truly flexible during adaptation. See example below ↓ 1/n
I'm attending ICML2025🇨🇦 this week. I'd love to connect and chat about representation learning, tokenization and knowledge distillation from a multilingual perspective. Also, let me know if you think there's someone I should be talking to at the conference!
🎉 We’re excited to introduce BLAB: Brutally Long Audio Bench, the first benchmark for evaluating long-form reasoning in audio LMs across 8 challenging tasks, using 833+ hours of Creative Commons audio. (avg length: 51 minutes).
Introducing SmolLM3: a strong, smol reasoner! > SoTA 3B model > dual mode reasoning (think/no_think) > long context, up to 128k > multilingual: en, fr, es, de, it, pt > fully open source (data, code, recipes) huggingface.co/blog/smollm3
We have finally released the 📝paper for 🥂FineWeb2, our large multilingual pre-training dataset. Along with general (and exhaustive) multilingual work, we introduce a concept that can also improve English performance: deduplication-based upsampling, which we call rehydration.
The call for papers is out for the 5th edition of the Workshop on Multilingual Representation Learning which will take place in Suzhou, China co-located with EMNLP 2025! See details below!