Roman Bachmann
@roman__bachmann
CS PhD student at @EPFL_en, previously at @Apple, @RIKEN_AIP. | Working on scalable multimodal foundation models.
Have you ever been bothered by the constraints of fixed-sized 2D-grid tokenizers? We present FlexTok, a flexible-length 1D tokenizer that enables autoregressive models to describe images in a coarse-to-fine manner. flextok.epfl.ch arxiv.org/abs/2502.13967 🧵 1/n

Excited to share our new work: “Language Models Improve When Pretraining Data Matches Target Tasks” Yes, it sounds obvious (and it is!), but typically this only happens implicitly and indirectly: intuitively select data → benchmark → refine → repeat. We wondered: what…
We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵
We will present FlexTok at #ICML2025 on Tuesday! Drop by to chat with @JRAllardice and me if you're interested in tokenization, flexible ways to encode images, and generative modeling. 📆 Tue, Jul 15, 16:30 PDT 📍 East Exhibition Hall, Poster E-3010 🌐 flextok.epfl.ch
Have you ever been bothered by the constraints of fixed-sized 2D-grid tokenizers? We present FlexTok, a flexible-length 1D tokenizer that enables autoregressive models to describe images in a coarse-to-fine manner. flextok.epfl.ch arxiv.org/abs/2502.13967 🧵 1/n
How well do multimodal foundation models understand images compared to vision specialists? 🤔 We benchmarked their geometric and semantic understanding capabilities on standard vision tasks and datasets. Check out our new paper!
We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress;…
We open-sourced the codebase of Flextok. Flextok is an image tokenizer that produces flexible-length token sequences and represents image content in a compressed coarse-to-fine way. Like in PCA: the 1st token captures the most compressed representation of the image, the 2nd…
Have you ever been bothered by the constraints of fixed-sized 2D-grid tokenizers? We present FlexTok, a flexible-length 1D tokenizer that enables autoregressive models to describe images in a coarse-to-fine manner. flextok.epfl.ch arxiv.org/abs/2502.13967 🧵 1/n
New blog post: let's talk about latents! sander.ai/2025/04/15/lat…
Happy to share that we released FlexTok code and models on github.com/apple/ml-flext…. Try them with our interactive @huggingface demo on huggingface.co/spaces/EPFL-VI…
Excited to share that we have recently released the source code for FlexTok, bringing a fresh perspective to tokenization. Code on GitHub: lnkd.in/g4iNJFmU. Project Page: flextok.epfl.ch #FlexTok #Tokenization #MachineLearning #MLResearch #OpenSource #AI
FlexTok is pretty novel dynamic length image tokenizer, I will be speedrunning training one today (8:30 AM EST) at twitch.tv/cloneofsimo, which is roughly in 3 hours
Honored to see our research featured on the @EPFL_en front page! Check out the article to learn more about our latest efforts in multimodality and where we go from here.
Researchers from our school have developed 4M, a next-generation, open-sourced framework for training versatile and scalable multimodal foundation models that go beyond language.💡🚀 go.epfl.ch/lrD-en
Happening today! 👀 If you'd like to discuss any-to-any multimodal models, tokenization, and scaling, come join @oguzhanthefatih, @zamir_ar, and me at poster 3709 in East Exhibit Hall A-C at 11am-2pm PST.
We are going to present 4M-21 next week at #NeurIPS2024 in Vancouver 🇨🇦. Come chat with us (@roman__bachmann, @zamir_ar and myself) if you are interested in multimodal foundation models! 📅 Thu 12 Dec 11 a.m. PST Poster 3709. 🌐 4m.epfl.ch with a live demo 📸
We are going to present 4M-21 next week at #NeurIPS2024 in Vancouver 🇨🇦. Come chat with us (@roman__bachmann, @zamir_ar and myself) if you are interested in multimodal foundation models! 📅 Thu 12 Dec 11 a.m. PST Poster 3709. 🌐 4m.epfl.ch with a live demo 📸
We are releasing 4M-21 with a permissive license, including its source code and trained models. It's a pretty effective multimodal model that solves 10s of tasks & modalities. See the demo code, sample results, and the tokenizers of diverse modalities on the website. IMO, the…
Apple releases AIMv2 Multimodal Autoregressive Pre-training of Large Vision Encoders