Oğuzhan Fatih Kar (@oguzhanthefatih)

Pinned

O

Oğuzhan Fatih Kar@oguzhanthefatih · May 13

Happy to share that I’ve successfully defended my PhD thesis, “Scaling the Modalities in Multimodal Foundation Models”! 🎓 🎉 A huge thanks to my incredible advisor @zamir_ar and all the amazing collaborators I’ve had the chance to work with across EPFL, Apple, and Google.

oguzhanthefatih's tweet image. Happy to share that I’ve successfully defended my PhD thesis,
“Scaling the Modalities in Multimodal Foundation Models”! 🎓 🎉

A huge thanks to my incredible advisor @zamir_ar and all the amazing collaborators I’ve had the chance to work with across EPFL, Apple, and Google.

8

9

262

52

21.0K

Oğuzhan Fatih Kar Retweeted

D

David Mizrahi@dmizrahi_ · Jul 17

Excited to share our new work: “Language Models Improve When Pretraining Data Matches Target Tasks” Yes, it sounds obvious (and it is!), but typically this only happens implicitly and indirectly: intuitively select data → benchmark → refine → repeat. We wondered: what…

7

50

400

406

53.0K

O

Oğuzhan Fatih Kar@oguzhanthefatih · Jul 11

We will present FlexTok at #ICML2025 on Tuesday! Drop by to chat with @JRAllardice and me if you're interested in tokenization, flexible ways to encode images, and generative modeling. 📆 Tue, Jul 15, 16:30 PDT 📍 East Exhibition Hall, Poster E-3010 🌐 flextok.epfl.ch

RRoman Bachmann@roman__bachmann · Feb 20

Have you ever been bothered by the constraints of fixed-sized 2D-grid tokenizers? We present FlexTok, a flexible-length 1D tokenizer that enables autoregressive models to describe images in a coarse-to-fine manner. flextok.epfl.ch arxiv.org/abs/2502.13967 🧵 1/n

0

6

24

3

1.0K

Oğuzhan Fatih Kar Retweeted

A

Amir Zamir@zamir_ar · Jul 6

Final Takeaways 📌 The multimodal foundation models are impressive generalists. However, they still lag behind vision specialists. 📌 They perform better on semantics (e.g., classification, segmentation) than geometry (depth, normals). 📌 Among the non-reasoning models, GPT-4o…

2

1

18

3

2.0K

O

Oğuzhan Fatih Kar@oguzhanthefatih · Jul 7

🚀 Excited to share our latest work on evaluating the visual capabilities of leading multimodal foundation models! 🔗 Code & prompt chains are open-sourced: fm-vision-evals.epfl.ch Big shoutout to our intern @rahul_ramach for leading the work, and to the whole team!

AAmir Zamir@zamir_ar · Jul 6

We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress;…

1

9

67

15

4.0K

O

Oğuzhan Fatih Kar@oguzhanthefatih · Jun 30

We open-sourced the codebase of Flextok. Flextok is an image tokenizer that produces flexible-length token sequences and represents image content in a compressed coarse-to-fine way. Like in PCA: the 1st token captures the most compressed representation of the image, the 2nd…

RRoman Bachmann@roman__bachmann · Feb 20

Have you ever been bothered by the constraints of fixed-sized 2D-grid tokenizers? We present FlexTok, a flexible-length 1D tokenizer that enables autoregressive models to describe images in a coarse-to-fine manner. flextok.epfl.ch arxiv.org/abs/2502.13967 🧵 1/n

6

83

474

346

48.0K

O

Oğuzhan Fatih Kar@oguzhanthefatih · Jun 16

Happy to share that I’ve joined @Apple as a Machine Learning Researcher! I’m looking forward to work on exciting projects in multimodal foundation models with an amazing team.

19

6

506

46

29.0K

O

Oğuzhan Fatih Kar@oguzhanthefatih · Apr 6

Happy to share that we released FlexTok code and models on github.com/apple/ml-flext…. Try them with our interactive @huggingface demo on huggingface.co/spaces/EPFL-VI…

AAfshin Dehghan@afshin_dn · Apr 4

Excited to share that we have recently released the source code for FlexTok, bringing a fresh perspective to tokenization. Code on GitHub: lnkd.in/g4iNJFmU. Project Page: flextok.epfl.ch #FlexTok #Tokenization #MachineLearning #MLResearch #OpenSource #AI

0

15

74

26

13.0K

O

Oğuzhan Fatih Kar@oguzhanthefatih · Feb 20

Check out our recent work on flexible image tokenization📷🗜️, led by amazing folks @roman__bachmann, @JRAllardice, @dmizrahi_🎉👏 Stay tuned for code, weights, live demo and more! flextok.epfl.ch

RRoman Bachmann@roman__bachmann · Feb 20

Have you ever been bothered by the constraints of fixed-sized 2D-grid tokenizers? We present FlexTok, a flexible-length 1D tokenizer that enables autoregressive models to describe images in a coarse-to-fine manner. flextok.epfl.ch arxiv.org/abs/2502.13967 🧵 1/n

0

2

12

2

1.0K

O

Oğuzhan Fatih Kar@oguzhanthefatih · Jan 7

Our recent works on multimodal AI is featured on the @EPFL_en front page! 🎉🚀

EEPFL@EPFL_en · Jan 7

Researchers from our school have developed 4M, a next-generation, open-sourced framework for training versatile and scalable multimodal foundation models that go beyond language.💡🚀 go.epfl.ch/lrD-en

1

2

47

4

4.0K

O

Oğuzhan Fatih Kar@oguzhanthefatih · Dec 12

Happening today! 👀 If you'd like to discuss any-to-any multimodal models, tokenization, and scaling, come join @oguzhanthefatih, @zamir_ar, and me at poster 3709 in East Exhibit Hall A-C at 11am-2pm PST.

OOğuzhan Fatih Kar@oguzhanthefatih · Dec 7

We are going to present 4M-21 next week at #NeurIPS2024 in Vancouver 🇨🇦. Come chat with us (@roman__bachmann, @zamir_ar and myself) if you are interested in multimodal foundation models! 📅 Thu 12 Dec 11 a.m. PST Poster 3709. 🌐 4m.epfl.ch with a live demo 📸

0

2

4

1

1.0K

O

Oğuzhan Fatih Kar@oguzhanthefatih · Dec 7

We are going to present 4M-21 next week at #NeurIPS2024 in Vancouver 🇨🇦. Come chat with us (@roman__bachmann, @zamir_ar and myself) if you are interested in multimodal foundation models! 📅 Thu 12 Dec 11 a.m. PST Poster 3709. 🌐 4m.epfl.ch with a live demo 📸

AAmir Zamir@zamir_ar · Jun 14, 2024

We are releasing 4M-21 with a permissive license, including its source code and trained models. It's a pretty effective multimodal model that solves 10s of tasks & modalities. See the demo code, sample results, and the tokenizers of diverse modalities on the website. IMO, the…

0

8

50

12

6.0K

O

Oğuzhan Fatih Kar@oguzhanthefatih · Nov 5

I am happy to be recognized as a top reviewer for #NeurIPS2024 ! 🎉 neurips.cc/Conferences/20…

2

0

30

0

2.0K

O

Oğuzhan Fatih Kar@oguzhanthefatih · Oct 1

Thanks @mervenoyann for sharing our work 🤗 see you all Thursday afternoon at #ECCV2024 to discuss VLMs!

mmerve@mervenoyann · Oct 1

One of my favorite vision language models is now BRAVE 🦁 (@oguzhanthefatih et al) Very simply put, BRAVE investigates using multiple pre-trained vision encoders. But what makes it different?

0

19

0

1.0K

O

Oğuzhan Fatih Kar@oguzhanthefatih · Sep 27

We will present BRAVE🦁 next week at #ECCV2024 in Milan! 🇮🇹 Come chat with us: 📅 Oral Presentation 6C: Vision and Other Modalities: 🗓 Thu, 3 Oct, 1:30 p.m. – 3:30 p.m. CEST 📅 Poster Session 6, 190: 🗓 Thu, 3 Oct, 4:30 p.m. – 6:30 p.m. CEST 🌐brave-vlms.epfl.ch

OOğuzhan Fatih Kar@oguzhanthefatih · Apr 11, 2024

We introduce BRAVE🦁 to broaden the visual capabilities of VLMs by leveraging diverse visual biases, enabling strong performance on several captioning & VQA tasks. Joint work w/ Alessio, Petra, Ace, @zamir_ar, @fedassa as part of my @Google internship. brave-vlms.epfl.ch

0

15

0

798

O

Oğuzhan Fatih Kar@oguzhanthefatih · Sep 25

4M-21 is accepted at #neurips2024! 🎉 Code, trained models and live demo available at 4m.epfl.ch Big congrats to the amazing team @roman__bachmann, @dmizrahi_, @aligarjani, @mingfei_gao, David Griffiths,@hujm99, @afshin_dn,@zamir_ar.

OOğuzhan Fatih Kar@oguzhanthefatih · Jun 14, 2024

Check out 4M-21, our newest any-to-any vision model that has strong out-of-the-box vision, generation and retrieval capabilities! Code and models are available at 4m.epfl.ch 🎉

1

4

30

3

3.0K

O

Oğuzhan Fatih Kar@oguzhanthefatih · Aug 12

🎉 BRAVE is accepted as oral at ECCV 2024! We focused on the visual capabilities of VLMs and proposed an efficient ensembling mechanism to boost them. See the project page for a quick summary and paper: brave-vlms.epfl.ch Congrats to my amazing team! #ECCV2024 @eccvconf

OOğuzhan Fatih Kar@oguzhanthefatih · Apr 11, 2024

We introduce BRAVE🦁 to broaden the visual capabilities of VLMs by leveraging diverse visual biases, enabling strong performance on several captioning & VQA tasks. Joint work w/ Alessio, Petra, Ace, @zamir_ar, @fedassa as part of my @Google internship. brave-vlms.epfl.ch

0

17

1

1.0K

O

Oğuzhan Fatih Kar@oguzhanthefatih · Jun 21, 2024

Thanks @mervenoyann for sharing our work and for the help with demo! 🥳🙌 Check out 4M-21 demo at huggingface.co/spaces/EPFL-VI…

mmerve@mervenoyann · Jun 21, 2024

4M is a multimodal training framework introduced by Apple and EPFL machinelearning.apple.com/research/massi… Resulting model takes image and text and output image and text 🤩 Models: huggingface.co/collections/EP… Demo: huggingface.co/spaces/EPFL-VI… Paper: huggingface.co/papers/2406.09…

1

14

1

2.0K

O

Oğuzhan Fatih Kar@oguzhanthefatih · Jun 14, 2024

Check out 4M-21, our newest any-to-any vision model that has strong out-of-the-box vision, generation and retrieval capabilities! Code and models are available at 4m.epfl.ch 🎉

AAmir Zamir@zamir_ar · Jun 14, 2024

We are releasing 4M-21 with a permissive license, including its source code and trained models. It's a pretty effective multimodal model that solves 10s of tasks & modalities. See the demo code, sample results, and the tokenizers of diverse modalities on the website. IMO, the…

0

2

20

6

6.0K