Nathan Godey
@nthngdy
Working on the representations of LMs and pretraining methods @Inria Paris https://nathangodey.github.io/
🚀 New Paper Alert! 🚀 We introduce Q-Filters, a training-free method for efficient KV Cache compression! It is compatible with FlashAttention and can compress along generation which is particularly useful for reasoning models ⚡ ⬇️R1-Distill-Llama-8B with 128 KV pairs ⬇️ 🧵
🏆 Our @nvidia KV Cache Compression Leaderboard is now live! Compare state-of-the-art compression methods side-by-side with KVPress. See which techniques are leading in efficiency and performance. 🥇 huggingface.co/spaces/nvidia/…
We produced FineWeb-Edu style annotations for biomedical data and showed that it helps for continued pre-training and lets us target domains to improve on! Work led by the amazing @riantouchent and supervised by @DeVillemonte 🌟 Check out the thread and paper below 👇🏼
Excited to introduce 𝗕𝗶𝗼𝗺𝗲𝗱-𝗘𝗻𝗿𝗶𝗰𝗵𝗲𝗱 🎉, a new annotated biomedical dataset designed to tackle the scarcity of clinical data for NLP research! 133M paragraphs from PMC-OA annotated for type, domain, and educational quality and publicly available on @huggingface👇🧵
Excited to introduce 𝗕𝗶𝗼𝗺𝗲𝗱-𝗘𝗻𝗿𝗶𝗰𝗵𝗲𝗱 🎉, a new annotated biomedical dataset designed to tackle the scarcity of clinical data for NLP research! 133M paragraphs from PMC-OA annotated for type, domain, and educational quality and publicly available on @huggingface👇🧵
ModernBERT or DeBERTaV3? What's driving performance: architecture or data? To find out we pretrained ModernBERT on the same dataset as CamemBERTaV2 (a DeBERTaV3 model) to isolate architecture effects. Here are our findings:
I'm looking for 2 emergency reviewers for ACL 2025 in the Language Modeling and Efficient methods for NLP tracks Please reach out in my DMs if you are interested and can do a review within 24 hours 😬
*Q-Filters: Leveraging QK Geometry for KV Cache Compression* by @nthngdy @devoto_alessio @yuzhaouoe @PMinervini @bensagot We find directions in the KV cache geometry allowing us to compress the cache significantly with little degradation in performance. arxiv.org/abs/2503.02812
🎉 Excited to share “Generalizing from Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning” 📄 (arxiv.org/pdf/2502.15592) We propose "context synthesis": instead of generating instructions from long texts, we synthesize contexts for instructions—drawing…
We find a single biased direction encodes a KV Cache selection mechanism in Self-Attention -- Key vector with a strong component in this direction results in this Key-Value pair being ignored by Query🚀🚀🚀
🚀 New Paper Alert! 🚀 We introduce Q-Filters, a training-free method for efficient KV Cache compression! It is compatible with FlashAttention and can compress along generation which is particularly useful for reasoning models ⚡ ⬇️R1-Distill-Llama-8B with 128 KV pairs ⬇️ 🧵
Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression