Oğuzhan Fatih Kar
@oguzhanthefatih
Machine Learning Researcher at @Apple. CS PhD @EPFL_en on multimodal foundation models. Previously @Google, @METU_ODTU, @aselsan.
Happy to share that I’ve successfully defended my PhD thesis, “Scaling the Modalities in Multimodal Foundation Models”! 🎓 🎉 A huge thanks to my incredible advisor @zamir_ar and all the amazing collaborators I’ve had the chance to work with across EPFL, Apple, and Google.



Excited to share our new work: “Language Models Improve When Pretraining Data Matches Target Tasks” Yes, it sounds obvious (and it is!), but typically this only happens implicitly and indirectly: intuitively select data → benchmark → refine → repeat. We wondered: what…
We will present FlexTok at #ICML2025 on Tuesday! Drop by to chat with @JRAllardice and me if you're interested in tokenization, flexible ways to encode images, and generative modeling. 📆 Tue, Jul 15, 16:30 PDT 📍 East Exhibition Hall, Poster E-3010 🌐 flextok.epfl.ch
Have you ever been bothered by the constraints of fixed-sized 2D-grid tokenizers? We present FlexTok, a flexible-length 1D tokenizer that enables autoregressive models to describe images in a coarse-to-fine manner. flextok.epfl.ch arxiv.org/abs/2502.13967 🧵 1/n
Final Takeaways 📌 The multimodal foundation models are impressive generalists. However, they still lag behind vision specialists. 📌 They perform better on semantics (e.g., classification, segmentation) than geometry (depth, normals). 📌 Among the non-reasoning models, GPT-4o…
🚀 Excited to share our latest work on evaluating the visual capabilities of leading multimodal foundation models! 🔗 Code & prompt chains are open-sourced: fm-vision-evals.epfl.ch Big shoutout to our intern @rahul_ramach for leading the work, and to the whole team!
We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress;…
We open-sourced the codebase of Flextok. Flextok is an image tokenizer that produces flexible-length token sequences and represents image content in a compressed coarse-to-fine way. Like in PCA: the 1st token captures the most compressed representation of the image, the 2nd…
Have you ever been bothered by the constraints of fixed-sized 2D-grid tokenizers? We present FlexTok, a flexible-length 1D tokenizer that enables autoregressive models to describe images in a coarse-to-fine manner. flextok.epfl.ch arxiv.org/abs/2502.13967 🧵 1/n
Happy to share that I’ve joined @Apple as a Machine Learning Researcher! I’m looking forward to work on exciting projects in multimodal foundation models with an amazing team.
Happy to share that we released FlexTok code and models on github.com/apple/ml-flext…. Try them with our interactive @huggingface demo on huggingface.co/spaces/EPFL-VI…
Excited to share that we have recently released the source code for FlexTok, bringing a fresh perspective to tokenization. Code on GitHub: lnkd.in/g4iNJFmU. Project Page: flextok.epfl.ch #FlexTok #Tokenization #MachineLearning #MLResearch #OpenSource #AI
Check out our recent work on flexible image tokenization📷🗜️, led by amazing folks @roman__bachmann, @JRAllardice, @dmizrahi_🎉👏 Stay tuned for code, weights, live demo and more! flextok.epfl.ch
Have you ever been bothered by the constraints of fixed-sized 2D-grid tokenizers? We present FlexTok, a flexible-length 1D tokenizer that enables autoregressive models to describe images in a coarse-to-fine manner. flextok.epfl.ch arxiv.org/abs/2502.13967 🧵 1/n
Our recent works on multimodal AI is featured on the @EPFL_en front page! 🎉🚀
Researchers from our school have developed 4M, a next-generation, open-sourced framework for training versatile and scalable multimodal foundation models that go beyond language.💡🚀 go.epfl.ch/lrD-en
Happening today! 👀 If you'd like to discuss any-to-any multimodal models, tokenization, and scaling, come join @oguzhanthefatih, @zamir_ar, and me at poster 3709 in East Exhibit Hall A-C at 11am-2pm PST.
We are going to present 4M-21 next week at #NeurIPS2024 in Vancouver 🇨🇦. Come chat with us (@roman__bachmann, @zamir_ar and myself) if you are interested in multimodal foundation models! 📅 Thu 12 Dec 11 a.m. PST Poster 3709. 🌐 4m.epfl.ch with a live demo 📸
We are going to present 4M-21 next week at #NeurIPS2024 in Vancouver 🇨🇦. Come chat with us (@roman__bachmann, @zamir_ar and myself) if you are interested in multimodal foundation models! 📅 Thu 12 Dec 11 a.m. PST Poster 3709. 🌐 4m.epfl.ch with a live demo 📸
We are releasing 4M-21 with a permissive license, including its source code and trained models. It's a pretty effective multimodal model that solves 10s of tasks & modalities. See the demo code, sample results, and the tokenizers of diverse modalities on the website. IMO, the…
I am happy to be recognized as a top reviewer for #NeurIPS2024 ! 🎉 neurips.cc/Conferences/20…
Thanks @mervenoyann for sharing our work 🤗 see you all Thursday afternoon at #ECCV2024 to discuss VLMs!
One of my favorite vision language models is now BRAVE 🦁 (@oguzhanthefatih et al) Very simply put, BRAVE investigates using multiple pre-trained vision encoders. But what makes it different?
We will present BRAVE🦁 next week at #ECCV2024 in Milan! 🇮🇹 Come chat with us: 📅 Oral Presentation 6C: Vision and Other Modalities: 🗓 Thu, 3 Oct, 1:30 p.m. – 3:30 p.m. CEST 📅 Poster Session 6, 190: 🗓 Thu, 3 Oct, 4:30 p.m. – 6:30 p.m. CEST 🌐brave-vlms.epfl.ch
We introduce BRAVE🦁 to broaden the visual capabilities of VLMs by leveraging diverse visual biases, enabling strong performance on several captioning & VQA tasks. Joint work w/ Alessio, Petra, Ace, @zamir_ar, @fedassa as part of my @Google internship. brave-vlms.epfl.ch
4M-21 is accepted at #neurips2024! 🎉 Code, trained models and live demo available at 4m.epfl.ch Big congrats to the amazing team @roman__bachmann, @dmizrahi_, @aligarjani, @mingfei_gao, David Griffiths,@hujm99, @afshin_dn,@zamir_ar.
Check out 4M-21, our newest any-to-any vision model that has strong out-of-the-box vision, generation and retrieval capabilities! Code and models are available at 4m.epfl.ch 🎉
🎉 BRAVE is accepted as oral at ECCV 2024! We focused on the visual capabilities of VLMs and proposed an efficient ensembling mechanism to boost them. See the project page for a quick summary and paper: brave-vlms.epfl.ch Congrats to my amazing team! #ECCV2024 @eccvconf
We introduce BRAVE🦁 to broaden the visual capabilities of VLMs by leveraging diverse visual biases, enabling strong performance on several captioning & VQA tasks. Joint work w/ Alessio, Petra, Ace, @zamir_ar, @fedassa as part of my @Google internship. brave-vlms.epfl.ch
Thanks @mervenoyann for sharing our work and for the help with demo! 🥳🙌 Check out 4M-21 demo at huggingface.co/spaces/EPFL-VI…
4M is a multimodal training framework introduced by Apple and EPFL machinelearning.apple.com/research/massi… Resulting model takes image and text and output image and text 🤩 Models: huggingface.co/collections/EP… Demo: huggingface.co/spaces/EPFL-VI… Paper: huggingface.co/papers/2406.09…
Check out 4M-21, our newest any-to-any vision model that has strong out-of-the-box vision, generation and retrieval capabilities! Code and models are available at 4m.epfl.ch 🎉
We are releasing 4M-21 with a permissive license, including its source code and trained models. It's a pretty effective multimodal model that solves 10s of tasks & modalities. See the demo code, sample results, and the tokenizers of diverse modalities on the website. IMO, the…