TimDarcet
@TimDarcet
PhD student, building big vision models @ INRIA & FAIR (Meta)
1/ This week we released DINOv2: a series of general vision encoders pretrained without supervision. Good out-of-the-box performance on a variety of domains, matching or surpassing other publicly available encoders.
Why does Meta open-source its models? I talked about it with @kawecki_maciej looking at Dino, our computer vision model with applications in forest mapping, medical research, agriculture and more. Open-source boosts AI access, transparency, and safety. youtube.com/watch?v=eNGafi…
~400 people have joined us on Sunday at @Cohere_Labs Open Science Community ML Summer School. @TimDarcet as always, delivering a super amazing talk on Scaling Self Supervised Learning (SSL, Dinov2, Masked Image Modeling, CAPI) Super interesting session.
Hey I'm a doctor now, neat
🚨New doctor in the house!🚨 Congrats to @TimDarcet for his tremendous work (DINOv2, registers, CAPI) & successful PhD defense followed by ~2 hrs of questions -- he's got stamina! Congrats to his incredible team of advisors from Inria & Meta: @julienmairal @p_bojanowski M. Oquab
I already know who Jiahui is you don’t have to tell me
On the left is Ronaldo, Real Madrid spent 80M $ to sign him from Man United On the right is Jiahui Yu, Meta paid 100M $ to sign him from OpenAI
In case there is any ambiguity: DINOv2 is 100% a product of dumb hill-climbing on ImageNet-1k knn accuracy (and linear too) Overfitting an eval can be bad. But sometimes the reward signal is reliable, and leads to truly good models. It's about finding a balance
Oh I am a big fan of self supervised learning. Also ssl has never been benchmark maxing on imagenet afaik. I am mainly complaining about the supervised classification imagenet hill climb
FFS @huggingface please stop doing that it makes you look like pretentious assholes
HF stealing all generic (pypi) package names
Great summary of dino.txt by Fede! Drop by the poster if you're at CVPR! 📅 Sunday, June 15 🕥 10:30 - 12:30 📍 Poster 370
DINOv2 meets text at #CVPR 2025! Why choose between high-quality DINO features and CLIP-style vision-language alignment? Pick both with dino.txt 🦖📖 We align frozen DINOv2 features with text captions, obtaining both image-level and patch-level alignment at a minimal cost. [1/N]
Vision transformers have high-norm outliers that hurt performance and distort attention. While prior work removed them by retraining with “register” tokens, we find the mechanism behind outliers and make registers at ✨test-time✨—giving clean features and better performance! 🧵
Two things teach intellectual humility: people smarter than you, and maths. Doing math with people smarter than you is sort of a bit too much.
So the reason I was asking about this is because the squared L2 has the very pleasant property of reducing to "just push away from the avg" and that would eliminate all batch size issues (you an use an EMA avg) It's basically what DINO does, w/ softmax+CE loss instead of L2
Is there a good reason we use softmax losses in contrastive learning, instead of just doing MSE? ie L = ||xi-xi'||² - lambda sum_k ||xi-xk'||² I'd guess the optimization dynamics are maybe friendlier, but does anyone have a good pointer? Both for CLIP and SSL btw
Is there a good reason we use softmax losses in contrastive learning, instead of just doing MSE? ie L = ||xi-xi'||² - lambda sum_k ||xi-xk'||² I'd guess the optimization dynamics are maybe friendlier, but does anyone have a good pointer? Both for CLIP and SSL btw
Oh these plots are great too They fit with my observations on norms of different things
arxiv.org/abs/2410.10781… oh, it's interesting that attn sink is caused by angular difference rather than its scale.
Summary of "Massive activations in LLMs": - "artifact" tokens are in all transformers, ViTs and LLMs - their weirdness is ~only on 1 channel - they are the same as the quantization outliers - their purpose is *not* global information - there's a fix simpler than registers
Could you give a summary for all the lazy readers who won't open the link?
I also view layernorm as hyperplane proj + hypersphere proj Hyperplane proj makes no sense, hence we do RMSnorm now Although don't forget the epsilon. We project onto the hyper*ball* actually
Absolutely gold article. Changed the way I see Layer Norm
Ok there's a new paper in my top 3 favorites Vision transformers need registers Clear problem, elegant solution, well written, easy to understand, good results, limitations included. No fancy losses or layers. No equation (at all!) Here's a short summary: (1/4)