Raphaël Sourty
@raphaelsrty
Language Models, Knowledge Bases, Knowledge Distillation PhD | AI @LightonIO
I'm thrilled to announce the release of FastPlaid ! 🚀🚀 FastPlaid is a high-performance engine for multi-vector search, built from the ground up in Rust (with the help of Torch C++)⚡️ You can view FastPlaid as the counterpart of Faiss for multi vectors.
🚀 We’re excited to introduce Qwen3-235B-A22B-Thinking-2507 — our most advanced reasoning model yet! Over the past 3 months, we’ve significantly scaled and enhanced the thinking capability of Qwen3, achieving: ✅ Improved performance in logical reasoning, math, science & coding…
Great excuse to share something I really love: 1-Lipschitz nets. They give clean theory, certs for robustness, the right loss for W-GANs, even nicer grads for explainability!! Yet are still niche. Here’s a speed-run through some of my favorite papers on the field. 🧵👇
optimization theorem: "assume a lipschitz constant L..." the lipschitz constant:
🚀 Introducing Qwen3-MT – our most powerful translation model yet! Trained on trillions of multilingual tokens, it supports 92+ languages—covering 95%+ of the world’s population. 🌍✨ 🔑 Why Qwen3-MT? ✅ Top-tier translation quality ✅ Customizable: terminology control, domain…
🚀 Introducing GLiClass‑V3 – a leap forward in zero-shot classification! Matches or beats cross-encoder accuracy, while being up to 50× faster. Real-time inference is now possible on edge hardware. huggingface.co/collections/kn… #TextClassification #NLP #ZeroShot #GLiClass
Ok the solution might be way easier than expected ... it's been a long time since we did not release a SOTA model, isn't it? 😇
I am starting to be more and more convinced that MaxSim generalize very well to long documents but struggles on longer query, most probably due to the asymmetry Larger documents are bound by the number of query tokens, but larger queries might get noisy Either it is a query…
Bye Qwen3-235B-A22B, hello Qwen3-235B-A22B-2507! After talking with the community and thinking it through, we decided to stop using hybrid thinking mode. Instead, we’ll train Instruct and Thinking models separately so we can get the best quality possible. Today, we’re releasing…
Some of the ModernBERT team is back with new encoder models: Ettin, ranging from tiny to small: 17M, 32M, 68M, 150M, 400M & 1B parameters. They also trained decoder models & checked if decoders could classify & if encoders could generate. Details in 🧵:
Introducing ColQwen-Omni, a 3B omnimodal retriever that extends the ColPali concept of multimodal retrieval with late interaction to audio chunks and short videos, with no performance degradation on visual document retrieval wrt our best models! (1/N)
TorchDR 0.3 is here with some major improvements, taking the library to the next level! TorchDR leverages vectorized implementation on GPU for super fast dimensionality reduction. Thanks to all the contributors!! Description below🧵
✊ Transformers... Assemble! Introducing ♊Ettin Suite, a SoTA open recipe to outperform existing Generative & Retrieval Models. Developed by @JohnsHopkins in collaboration with @LightOnIO, Ettin is the first-ever SoTA suite of paired encoder & decoder models. The revolution…
To anyone wondering what's the difference between encoders and decoders on downstream tasks when both models are trained the same way, this blog post is made for you. Very interesting resource and new models available, impressive work 🙌
Should we just focus our pre-training efforts on decoders? To answer this, we trained Ettin, various identically trained encoders and decoders, ranging from 17M to 1B parameters on 2T tokens of open data (beating Llama 3.2 and ModernBERT in the process)!
🤔 Have you ever wondered how good ModernBERT is compared to decoders like Llama? We made an open-data version of ModernBERT and used the same recipe for encoders and decoders. Turns out, our encoder model beat ModernBERT and our decoder model beats Llama 3.2 / SmolLM2 🤯 🧵
The #SIGIR2025 Best Paper just awarded to the WARP engine for fast late interaction! Congrats to Luca Scheerer🎉 WARP was his @ETH_en MS thesis, completed while visiting us at @StanfordNLP. Incidentally, it's the fifth Paper Award for a ColBERT paper since 2020!* Luca did an…
📢 If you’re at #SIGIR2025 this week, make sure to be at Luca Scheerer’s paper talk: “WARP: An Efficient Engine for Multi-Vector Retrieval” (Wednesday 11am) WARP makes PLAID, the famous ludicrously fast ColBERT engine, another 3x faster on CPUs. With the usual ColBERT quality!
Alright that’s it WARP coming to PyLate soon™️
The #SIGIR2025 Best Paper just awarded to the WARP engine for fast late interaction! Congrats to Luca Scheerer🎉 WARP was his @ETH_en MS thesis, completed while visiting us at @StanfordNLP. Incidentally, it's the fifth Paper Award for a ColBERT paper since 2020!* Luca did an…
This was a really enjoyable and approachable blog post. I think this is my favorite explanation of MaxSim and I'm going to use it moving forward. But this snippet doesn't do the whole post justice---read the whole thing!
If you've made it this far down the thread, you might want a link reminder, so here you are: Github: github.com/mixedbread-ai/… Blog: mixedbread.com/blog/maxsim-cpu
Awesome new lib from @bclavie to make MaxSim computation quicker on CPU ! In the same vibe, also need to underline pylate-rs by @raphaelsrty ! Both libs go a long way into making Late Interaction (ColPali, ColBert) ever more accessible !
New blog post & new library are out now! The BP is about MaxSim, why it's *orders of magnitude* much more demanding than normal cosine similarity, and why GPUs don't care, but CPUs do! The library is maxsim-cpu, which makes it so CPUs can be fast and play it cool, too.