Fenil Doshi
@fenildoshi009
PhD student @Harvard and @KempnerInst studying biological and machine vision | object perception | mid-level vision | cortical organization
🧵 What if two images have the same local parts but represent different global shapes purely through part arrangement? Humans can spot the difference instantly! The question is can vision models do the same? 1/15
Quick thread on the recent IMO results and the relationship between symbol manipulation, reasoning, and intelligence in machines and humans:
🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n
Why do video models handle motion so poorly? It might be lack of motion equivariance. Very excited to introduce: Flow Equivariant RNNs (FERNNs), the first sequence models to respect symmetries over time. Paper: arxiv.org/abs/2507.14793 Blog: kempnerinstitute.harvard.edu/research/deepe… 1/🧵
New in the #DeeperLearningBlog: #KempnerInstitute research fellow @t_andy_keller introduces the first flow equivariant neural networks, which reflect motion symmetries, greatly enhancing generalization and sequence modeling. bit.ly/451fQ48 #AI #NeuroAI
Great excuse to share something I really love: 1-Lipschitz nets. They give clean theory, certs for robustness, the right loss for W-GANs, even nicer grads for explainability!! Yet are still niche. Here’s a speed-run through some of my favorite papers on the field. 🧵👇
optimization theorem: "assume a lipschitz constant L..." the lipschitz constant:
Official results are in - Gemini achieved gold-medal level in the International Mathematical Olympiad! 🏆 An advanced version was able to solve 5 out of 6 problems. Incredible progress - huge congrats to @lmthang and the team! deepmind.google/discover/blog/…
Can open-data models beat DINOv2? Today we release Franca, a fully open-sourced vision foundation model. Franca with ViT-G backbone matches (and often beats) proprietary models like SigLIPv2, CLIP, DINOv2 on various benchmarks setting a new standard for open-source research🧵
We prompt a generative video model to extract state-of-the-art optical flow, using zero labels and no fine-tuning. Our method, KL-tracing, achieves SOTA results on TAP-Vid & generalizes to challenging YouTube clips. @khai_loong_aw @KlemenKotar @CristbalEyzagu2 @lee_wanhee_…
Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to guarantee the features we find are not spurious? No!⚠️ In our new paper, we show many mech int methods implicitly rely on the linear representation hypothesis🧵
Submit to our workshop on contextualizing Cogsci approaches for understanding neural networks---"Cognitive interpretability"!
We’re excited to announce the first workshop on CogInterp: Interpreting Cognition in Deep Learning Models @ NeurIPS 2025! 📣 How can we interpret the algorithms and representations underlying complex behavior in deep learning models? 🌐 coginterp.github.io/neurips2025/ 1/
🧠 Submit to CogInterp @ NeurIPS 2025! Bridging AI & cognitive science to understand how models think, reason & represent. CFP + details 👉 coginterp.github.io/neurips2025/
We’re excited to announce the first workshop on CogInterp: Interpreting Cognition in Deep Learning Models @ NeurIPS 2025! 📣 How can we interpret the algorithms and representations underlying complex behavior in deep learning models? 🌐 coginterp.github.io/neurips2025/ 1/
📢 New preprint! Do LVLMs have strong visual perception capabilities? Not quite yet... We introduce VisOnlyQA, a new dataset for evaluating the visual perception of LVLMs, but existing LVLMs perform poorly on our dataset. [1/n] arxiv.org/abs/2412.00947 github.com/psunlpgroup/Vi…
Impressed by DINOv2 perf. but don't want to spend too much $$$ on compute and wait for days to pretrain on your own data? Say no more! Data augmentation curriculum speeds up SSL pretraining (as it did for generative and supervised learning) -> FastDINOv2! arxiv.org/abs/2507.03779
Exciting new preprint from the lab: “Adopting a human developmental visual diet yields robust, shape-based AI vision”. A most wonderful case where brain inspiration massively improved AI solutions. Work with @lu_zejin @martisamuser and Radoslaw Cichy arxiv.org/abs/2507.03168
Coming up at ICML: 🤯Distribution shifts are still a huge challenge in ML. There's already a ton of algorithms to address specific conditions. So what if the challenge was just selecting the right algorithm for the right conditions?🤔🧵
We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress;…
Updated paper! Our main new finding: by creating attention biases at test time—without extra tokens—we remove high-norm outliers and attention sinks in ViTs, while preserving zero-shot ImageNet performance. Maybe ViTs don’t need registers after all? x.com/nickhjiang/sta…
Vision transformers have high-norm outliers that hurt performance and distort attention. While prior work removed them by retraining with “register” tokens, we find the mechanism behind outliers and make registers at ✨test-time✨—giving clean features and better performance! 🧵
🚨New paper! We know models learn distinct in-context learning strategies, but *why*? Why generalize instead of memorize to lower loss? And why is generalization transient? Our work explains this & *predicts Transformer behavior throughout training* without its weights! 🧵 1/
Cool work uses "visual anagrams": two images of different objects made out of the same image patches. Model must classify both correctly to score. Hence, higher scoring models use global geometry, lower use textures. SigLIP is GOAT of course, or I wouldn't repost this (jk)
We find self-supervised & language-aligned ViTs scored highest on CSS — even matching humans. Supervised models are not even close. Surprisingly, high ImageNet accuracy does not guarantee high configural shape score! 5/15
Beautiful work (and thread)! They revisit the well-known shape-vs-texture bias, this time with objects made of the same subparts. With ablations they confirm the intuition that long-range attention mechanisms are essential for transformers to "see" the global shape in the picture
🧵 What if two images have the same local parts but represent different global shapes purely through part arrangement? Humans can spot the difference instantly! The question is can vision models do the same? 1/15