Roman Bachmann (@roman__bachmann)

Pinned

R

Roman Bachmann@roman__bachmann · Feb 20

Have you ever been bothered by the constraints of fixed-sized 2D-grid tokenizers? We present FlexTok, a flexible-length 1D tokenizer that enables autoregressive models to describe images in a coarse-to-fine manner. flextok.epfl.ch arxiv.org/abs/2502.13967 🧵 1/n

roman__bachmann's tweet image. Have you ever been bothered by the constraints of fixed-sized 2D-grid tokenizers? We present FlexTok, a flexible-length 1D tokenizer that enables autoregressive models to describe images in a coarse-to-fine manner.

flextok.epfl.ch
arxiv.org/abs/2502.13967

🧵 1/n

6

32

187

86

58.0K

Roman Bachmann Retweeted

D

David Mizrahi@dmizrahi_ · Jul 17

Excited to share our new work: “Language Models Improve When Pretraining Data Matches Target Tasks” Yes, it sounds obvious (and it is!), but typically this only happens implicitly and indirectly: intuitively select data → benchmark → refine → repeat. We wondered: what…

7

50

400

406

53.0K

Roman Bachmann Retweeted

M

Mustafa Shukor@MustafaShukor1 · Jul 15

We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵

5

47

267

210

28.0K

R

Roman Bachmann@roman__bachmann · Jul 11

We will present FlexTok at #ICML2025 on Tuesday! Drop by to chat with @JRAllardice and me if you're interested in tokenization, flexible ways to encode images, and generative modeling. 📆 Tue, Jul 15, 16:30 PDT 📍 East Exhibition Hall, Poster E-3010 🌐 flextok.epfl.ch

RRoman Bachmann@roman__bachmann · Feb 20

Have you ever been bothered by the constraints of fixed-sized 2D-grid tokenizers? We present FlexTok, a flexible-length 1D tokenizer that enables autoregressive models to describe images in a coarse-to-fine manner. flextok.epfl.ch arxiv.org/abs/2502.13967 🧵 1/n

0

6

24

3

1.0K

R

Roman Bachmann@roman__bachmann · Jul 6

How well do multimodal foundation models understand images compared to vision specialists? 🤔 We benchmarked their geometric and semantic understanding capabilities on standard vision tasks and datasets. Check out our new paper!

AAmir Zamir@zamir_ar · Jul 6

We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress;…

0

14

2

584

R

Roman Bachmann@roman__bachmann · Jun 30

We open-sourced the codebase of Flextok. Flextok is an image tokenizer that produces flexible-length token sequences and represents image content in a compressed coarse-to-fine way. Like in PCA: the 1st token captures the most compressed representation of the image, the 2nd…

RRoman Bachmann@roman__bachmann · Feb 20

Have you ever been bothered by the constraints of fixed-sized 2D-grid tokenizers? We present FlexTok, a flexible-length 1D tokenizer that enables autoregressive models to describe images in a coarse-to-fine manner. flextok.epfl.ch arxiv.org/abs/2502.13967 🧵 1/n

6

83

474

346

48.0K

Roman Bachmann Retweeted

S

Sander Dieleman@sedielem · Apr 15

New blog post: let's talk about latents! sander.ai/2025/04/15/lat…

29

195

1.0K

882

165.0K

R

Roman Bachmann@roman__bachmann · Apr 6

Happy to share that we released FlexTok code and models on github.com/apple/ml-flext…. Try them with our interactive @huggingface demo on huggingface.co/spaces/EPFL-VI…

AAfshin Dehghan@afshin_dn · Apr 4

Excited to share that we have recently released the source code for FlexTok, bringing a fresh perspective to tokenization. Code on GitHub: lnkd.in/g4iNJFmU. Project Page: flextok.epfl.ch #FlexTok #Tokenization #MachineLearning #MLResearch #OpenSource #AI

0

15

74

26

13.0K

Roman Bachmann Retweeted

S

Simo Ryu@cloneofsimo · Mar 31

FlexTok is pretty novel dynamic length image tokenizer, I will be speedrunning training one today (8:30 AM EST) at twitch.tv/cloneofsimo, which is roughly in 3 hours

13

35

431

272

29.0K

R

Roman Bachmann@roman__bachmann · Jan 7

Honored to see our research featured on the @EPFL_en front page! Check out the article to learn more about our latest efforts in multimodality and where we go from here.

EEPFL@EPFL_en · Jan 7

Researchers from our school have developed 4M, a next-generation, open-sourced framework for training versatile and scalable multimodal foundation models that go beyond language.💡🚀 go.epfl.ch/lrD-en

0

13

0

376

R

Roman Bachmann@roman__bachmann · Dec 12

Happening today! 👀 If you'd like to discuss any-to-any multimodal models, tokenization, and scaling, come join @oguzhanthefatih, @zamir_ar, and me at poster 3709 in East Exhibit Hall A-C at 11am-2pm PST.

OOğuzhan Fatih Kar@oguzhanthefatih · Dec 7

We are going to present 4M-21 next week at #NeurIPS2024 in Vancouver 🇨🇦. Come chat with us (@roman__bachmann, @zamir_ar and myself) if you are interested in multimodal foundation models! 📅 Thu 12 Dec 11 a.m. PST Poster 3709. 🌐 4m.epfl.ch with a live demo 📸

0

2

4

1

1.0K

R

Roman Bachmann@roman__bachmann · Dec 7

We are going to present 4M-21 next week at #NeurIPS2024 in Vancouver 🇨🇦. Come chat with us (@roman__bachmann, @zamir_ar and myself) if you are interested in multimodal foundation models! 📅 Thu 12 Dec 11 a.m. PST Poster 3709. 🌐 4m.epfl.ch with a live demo 📸

AAmir Zamir@zamir_ar · Jun 14, 2024

We are releasing 4M-21 with a permissive license, including its source code and trained models. It's a pretty effective multimodal model that solves 10s of tasks & modalities. See the demo code, sample results, and the tokenizers of diverse modalities on the website. IMO, the…

0

8

50

12

6.0K

Roman Bachmann Retweeted

A

AK@_akhaliq · Nov 22

Apple releases AIMv2 Multimodal Autoregressive Pre-training of Large Vision Encoders

4

85

552

296

55.0K