Efstathios Karypidis
@K_Sta8is
PhD Candidate, Archimedes Unit | National Technical University of Athens
1/n 🚀 Excited to share our latest work: DINO-Foresight, a new framework for predicting the future states of scenes using Vision Foundation Model features! Links to the arXiv and Github 👇

Just back from CVPR@Paris 🇫🇷, what a fantastic event! Great talks, great posters, and great to connect with the French & European vision community. Kudos to the organizers, hoping that it returns next year! 🤞 #CVPR2025 @CVPR
📢 R u in Athens on July 22? 📢 Check out the #ComputerVision Day @ ArchimedesAI! Talks: 👉@VickyKalogeiton: 'Efficient Brains that Imagine' 👉Dimitris Samaras: 'From Saliency to Scanpaths: 20 years of Wandering Eyes' 👉@dimtzionas: 'Towards In-the-Wild Understanding of 3D…
Interesting alternative to multi-token prediction, though the figure is a bit unintuitive. Instead of attaching a head for each +d'th prediction, pass a dummy input token for each extra prediction through the model. This is A LOT more expensive, e.g. doing 2-step prediction…
1/n Multi-token prediction boosts LLMs (DeepSeek-V3), tackling key limitations of the next-token setup: • Short-term focus • Struggles with long-range decisions • Weaker supervision Prior methods add complexity (extra layers) 🔑 Our fix? Register tokens—elegant and powerful
Nice trick for fine-tuning with multi-token prediction without architecture changes: interleave learnable register tokens into the input sequence & discard them at inference. It works for supervised fine-tuning, PEFT, pretraining, on both language and vision domains 👇
1/n Multi-token prediction boosts LLMs (DeepSeek-V3), tackling key limitations of the next-token setup: • Short-term focus • Struggles with long-range decisions • Weaker supervision Prior methods add complexity (extra layers) 🔑 Our fix? Register tokens—elegant and powerful
New paper out - accepted at @ICCVConference We introduce MoSiC, a self-supervised learning framework that learns temporally consistent representations from video using motion cues. Key idea: leverage long-range point tracks to enforce dense feature coherence across time.🧵
1/n 🚀New paper out - accepted at @ICCVConference! Introducing DIP: unsupervised post-training that enhances dense features in pretrained ViTs for dense in-context scene understanding Below: Low-shot in-context semantic segmentation examples. DIP features outperform DINOv2!
Achievement unlocked: having Alyosha at our FUNGI poster, the one person I had in mind when working on this paper on cheap and better representations for k-nn classification and not only #cvprinparis #cvpr2025
Self-supervised learning is fantastic for pretraining, but can we use it for other tasks (kNN classification, in-context learning) & modalities, w/o training & by simply using its gradients as features? Enter 🍄FUNGI - Features from UNsupervised GradIents #NeurIPS2024 🧵
🚀UniWorld: a unified model that skips VAEs and uses semantic features from SigLIP! Using just 1% of BAGEL’s data, it outperforms on image editing and excels in understanding & generation. 🌟Now data, model, training & evaluation script are open-source! github.com/PKU-YuanGroup/…
Better LLM training? @GregorBachmann1 & @_vaishnavh showed next-token prediction causes shortcut learning. A fix? Multi-token prediction training (thanks @FabianGloeckle) We use register tokens: minimal architecture changes & scalable prediction horizons x.com/NasosGer/statu…
1/n Multi-token prediction boosts LLMs (DeepSeek-V3), tackling key limitations of the next-token setup: • Short-term focus • Struggles with long-range decisions • Weaker supervision Prior methods add complexity (extra layers) 🔑 Our fix? Register tokens—elegant and powerful
1/n Multi-token prediction boosts LLMs (DeepSeek-V3), tackling key limitations of the next-token setup: • Short-term focus • Struggles with long-range decisions • Weaker supervision Prior methods add complexity (extra layers) 🔑 Our fix? Register tokens—elegant and powerful
EQ-VAE is accepted at #ICML2025 😁. Grateful to my co-authors for their guidance and collaboration! @IoannisKakogeo1, @SpyrosGidaris, Nikos Komodakis.
1/n🚀If you’re working on generative image modeling, check out our latest work! We introduce EQ-VAE, a simple yet powerful regularization approach that makes latent representations equivariant to spatial transformations, leading to smoother latents and better generative models.👇
🌌🛰️Wanna know which features are universal vs unique in your models and how to find them? Excited to share our preprint: "Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment"! arxiv.org/abs/2502.03714 (1/9)
1/n Introducing ReDi (Representation Diffusion): a new generative approach that leverages a diffusion model to jointly capture – Low-level image details (via VAE latents) – High-level semantic features (via DINOv2)🧵
The sdxl-VAE models a substantial amount of noise. Things we can't even see. It meticulously encodes the noise, uses precious bottleneck capacity to store it, then faithfully reconstructs it in the decoder. I grabbed what I thought was a simple black vector circle on a white…
Made with Sora Input: KITTI image Prompt 1: “Make this into a semantic segmentation map” Prompt 2: “Make this into a depth map”