Amir Zamir
@zamir_ar
Assistant Prof of CS, @EPFL_en Swiss Federal Institute of Technology. Previously @Berkeley_AI, @StanfordAILab, @ucf. Into #ComputerVision, #MachineLearning, #AI
We are releasing 4M-21 with a permissive license, including its source code and trained models. It's a pretty effective multimodal model that solves 10s of tasks & modalities. See the demo code, sample results, and the tokenizers of diverse modalities on the website. IMO, the…
We are releasing the 1st version of 4M, a framework for training multimodal foundation models across tens of modalities & tasks, based on scalable masked modeling. Joint effort by @EPFL_en & @Apple. 4M: Massively Multimodal Masked Modeling 🌐4m.epfl.ch 🧵1/n
The best intellectual collaborations tend to be between two types of people: the intuitionist and the formalizer Sometimes an intuitionist's idea is so radical that no formalizer will engage. Reciprocally, a formal framework may be so complex that it lends to no intuition 1/4
What if AI isn’t about building solo geniuses, but designing social systems? Michael Jordan advocates blending ML, economics, and uncertainty management to prioritize social welfare over mere prediction. A must-read rethink. arxiv.org/abs/2507.06268…
Sadly, I am no longer a professor at ETH (@eth_en) due to very severe #longCovid and #MECFS. ethrat.ch/de/ernennungen….
We open-sourced the codebase of Flextok. Flextok is an image tokenizer that produces flexible-length token sequences and represents image content in a compressed coarse-to-fine way. Like in PCA: the 1st token captures the most compressed representation of the image, the 2nd…
Have you ever been bothered by the constraints of fixed-sized 2D-grid tokenizers? We present FlexTok, a flexible-length 1D tokenizer that enables autoregressive models to describe images in a coarse-to-fine manner. flextok.epfl.ch arxiv.org/abs/2502.13967 🧵 1/n
Many Congratulations to @jianyuan_wang, @MinghaoChen23, @n_karaev, Andrea Vedaldi, Christian Rupprecht and @davnov134 for winning the Best Paper Award @CVPR for "VGGT: Visual Geometry Grounded Transformer" 🥇🎉 🙌🙌 #CVPR2025!!!!!!
There will be a talk on meditation, concentration, and ancient Eastern wisdom by an expert monk next week. Followed by a week-long workshop with meditation training. Open to everyone. Tuesday, May 27th, 5:30 PM. memento.epfl.ch/event/wisdom-i… @EPFL_en @EPFL @ICepfl @epflSV

Happy to share that I’ve successfully defended my PhD thesis, “Scaling the Modalities in Multimodal Foundation Models”! 🎓 🎉 A huge thanks to my incredible advisor @zamir_ar and all the amazing collaborators I’ve had the chance to work with across EPFL, Apple, and Google.
We are pleased to invite you to NUS (@NUSingapore)-Swiss AI Workshop on Wed April 23 (a day before @iclr_conf #ICLR2025 in Singapore). Register for the workshop here (seats are limited!!!): forms.office.com/pages/response… This workshop is co-organized with the Swiss AI Initiative, a…
FlexTok is pretty novel dynamic length image tokenizer, I will be speedrunning training one today (8:30 AM EST) at twitch.tv/cloneofsimo, which is roughly in 3 hours
In Cambrian-1, we found that vision SSL representations usually lagged behind language-supervised ones -- but once the data gap is closed and scaling kicks in, performance catches up. We’ve tried scaling SSL before, but this is the first time I’ve seen real signal: SSL adapts to…
Can visual SSL match CLIP on VQA? Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.