Alaa El-Nouby
@alaa_nouby
Research Scientist at @Apple. Previous: @Meta (FAIR), @Inria, @MSFTResearch, @VectorInst and @UofG
𝗗𝗼𝗲𝘀 𝗮𝘂𝘁𝗼𝗿𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗽𝗿𝗲-𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝘄𝗼𝗿𝗸 𝗳𝗼𝗿 𝘃𝗶𝘀𝗶𝗼𝗻? 🤔 Delighted to share AIMv2, a family of strong, scalable, and open vision encoders that excel at multimodal understanding, recognition, and grounding. github.com/apple/ml-aim (🧵)
Feed the Children of Gaza, no child should be starving.
Consistent with its historic commitment to a just and lasting peace in the Middle East, I have decided that France will recognize the State of Palestine. I will make this solemn announcement before the United Nations General Assembly this coming September.…
we missed a banger paper in the grok4/k2 drop noise guys. these guys > look for optimal ways to select data mixes to get max improvement on a model given a target domain. > do multimodal validation > show good extrapolation accuracy (testing on 1.4B and predicting on 8B)
If you are at attending ICML today, consider checking out Samara’s poster on the role of sparsity in MoEs at 11 AM PDT. Poster ID: E-2810
🚨 One question that has always intrigued me is the role of different ways to increase a model's capacity: parameters, parallelizable compute, or sequential compute? We explored this through the lens of MoEs:
Deciding which data mixture to use has always been such a crucial part for nailing a good pre-training recipe. Check out this paper, led by @PierreAblin , @MustafaShukor1 and the team at Apple MLR, providing a principled way for selecting optimal data mixture weights!
We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵
We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵
@CVPR is around the corner!! Join us at the Workshop on T4V at #CVPR2025 with a great speaker lineup (@MikeShou1, @jw2yang4ai, @WenhuChen, @roeiherzig, Yuheng Li, Kristen Grauman) covering diverse topics! Website: sites.google.com/view/t4v-cvpr2… #CVPR #Transformer #Vision #T4V2025 #T4V
The Worldwide @LeRobotHF hackathon is in 2 weeks, and we have been cooking something for you… Introducing SmolVLA, a Vision-Language-Action model with light-weight architecture, pretrained on community datasets, with an asynchronous inference stack, to control robots🧵
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics PAPER: arxiv.org/abs/2506.01844
Hello World: My team at FAIR / @metaai (AI Research Agent) is looking to hire contractors across software engineering and ML. If you are interested and based in the UK, please fill in the following short EoI form: docs.google.com/forms/d/e/1FAI…
Me and the Egyptian King 👑 best player in the world - 47 G/As, totally unreal season. Let me know if you ever fancy a game of online chess! 😀.@MoSalah
I don't have to tell you what happened to these three boys. You already know. How awful is that?
Proud to report that TarFlow is accepted to #ICML2025 as a Spotlight 🎉 I’m really looking forward to new ideas and applications enabled by powerful Normalizing Flow models 🚀
We attempted to make Normalizing Flows work really well, and we are happy to report our findings in paper arxiv.org/pdf/2412.06329, and code github.com/apple/ml-tarfl…. [1/n]
I’ve been curious about how early vs late-fusion multimodal approaches compare in controlled conditions. Great to see this studied in depth. Turns out, optimal late fusion has higher params-to-data, and performance between early and late fusion is similar. Brilliant work from…
We release a large scale study to answer the following: - Is late fusion inherently better than early fusion for multimodal models? - How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧵
We have been thinking a lot about how to train truly native multimodal models: (1) what arch to use (early-fusion, late-fusion, MoEs)? (2) the impact of data mixtures (interleaved, img-cap, text data) We took a stab at answering these questions (and more) in this preprint ...
Excited to see further studies into early fusion vs late fusion models, in particular a great analysis into multimodal MoE’s aligned with our findings in MoMa on designing parameter specialization in multimodal LLMs. A few key things that helped us on top of the results presented…
We release a large scale study to answer the following: - Is late fusion inherently better than early fusion for multimodal models? - How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧵
Apple just broke the scaling laws for image models. Imagine creating Ghibli art, but 10x faster.
Mustafa keeps releasing multimodal bangers
We release a large scale study to answer the following: - Is late fusion inherently better than early fusion for multimodal models? - How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧵