Mustafa Shukor (@MustafaShukor1)

Pinned

M

Mustafa Shukor@MustafaShukor1 · Apr 11

We release a large scale study to answer the following: - Is late fusion inherently better than early fusion for multimodal models? - How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧵

MustafaShukor1's tweet image. We release a large scale study to answer the following:
- Is late fusion inherently better than early fusion for multimodal models?
- How do native multimodal models scale compared to LLMs.
- How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧵

8

77

433

351

69.0K

Mustafa Shukor Retweeted

t

tokenbender@tokenbender · Jul 17

we missed a banger paper in the grok4/k2 drop noise guys. these guys > look for optimal ways to select data mixes to get max improvement on a model given a target domain. > do multimodal validation > show good extrapolation accuracy (testing on 1.4B and predicting on 8B)

11

76

795

792

71.0K

Mustafa Shukor Retweeted

A

AK@_akhaliq · Jul 16

Scaling Laws for Optimal Data Mixtures

1

5

46

22

9.0K

M

Mustafa Shukor@MustafaShukor1 · Jul 15

Data mixtures are crucial for achieving strong pre-trained models. Loved collaborating on this project led by @PierreAblin and @MustafaShukor1 tackling data mixing ratios through the lens of scaling laws. Check out @MustafaShukor1's 🧵.

MMustafa Shukor@MustafaShukor1 · Jul 15

We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵

1

5

19

3

2.0K

M

Mustafa Shukor@MustafaShukor1 · Jul 15

i love this kind of empirical research - i always ask about data mixtures bc i'm curious about what works and why, so here we have some insights!

MMustafa Shukor@MustafaShukor1 · Jul 15

We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵

0

1

6

3

2.0K

M

Mustafa Shukor@MustafaShukor1 · Jul 15

Deciding which data mixture to use has always been such a crucial part for nailing a good pre-training recipe. Check out this paper, led by @PierreAblin , @MustafaShukor1 and the team at Apple MLR, providing a principled way for selecting optimal data mixture weights!

MMustafa Shukor@MustafaShukor1 · Jul 15

We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵

0

4

57

13

4.0K

M

Mustafa Shukor@MustafaShukor1 · Jul 15

Data mixing ratios are critical for modern LLM training. This work takes a first principles approach and develops scaling laws for the mixing ratios, enabling “train small” -> “get guarantees at scale”. Definitely worth a read.

MMustafa Shukor@MustafaShukor1 · Jul 15

We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵

0

2

14

1

958

Mustafa Shukor Retweeted

A

AK@_akhaliq · Jun 3

Hugging Face presents SmolVLA A Vision-Language-Action Model for Affordable and Efficient Robotics

5

58

298

115

28.0K

M

Mustafa Shukor@MustafaShukor1 · Jun 3

Blog post: huggingface.co/blog/smolvla

MMustafa Shukor@MustafaShukor1 · Jun 3

The Worldwide @LeRobotHF hackathon is in 2 weeks, and we have been cooking something for you… Introducing SmolVLA, a Vision-Language-Action model with light-weight architecture, pretrained on community datasets, with an asynchronous inference stack, to control robots🧵

0

4

27

8

2.0K

M

Mustafa Shukor@MustafaShukor1 · Jun 3

A corto plazo es difícil prever las implicaciones de esto, pero cuando te fijas en todo lo que están haciendo empresas como @huggingface y @nvidia para llevar la IA al mundo físico se te quitan todas las dudas al respecto de por qué la robótica es la próxima gran revolución…

MMustafa Shukor@MustafaShukor1 · Jun 3

The Worldwide @LeRobotHF hackathon is in 2 weeks, and we have been cooking something for you… Introducing SmolVLA, a Vision-Language-Action model with light-weight architecture, pretrained on community datasets, with an asynchronous inference stack, to control robots🧵

0

3

7

2

4.0K

Mustafa Shukor Retweeted

F

Francesco Capuano@_fracapuano · Jun 3

Robotics models are increasingly bulky and difficult to run directly on robots. With @RemiCadene and the team @LeRobotHF and @huggingface we’re changing that. Introducing SmolVLA, a sub-500M VLA designed for efficient training and inference. A thread 🧵

6

35

190

103

16.0K

M

Mustafa Shukor@MustafaShukor1 · Jun 3

SmolVLA is our compact neural network for robotics. It trains faster, run quicker and reach higher success rate than other foundational models out there. Even better, it was trained on open-source community data only! Thanks @danaaubakir @MustafaShukor1 @_fracapuano among others…

MMustafa Shukor@MustafaShukor1 · Jun 3

The Worldwide @LeRobotHF hackathon is in 2 weeks, and we have been cooking something for you… Introducing SmolVLA, a Vision-Language-Action model with light-weight architecture, pretrained on community datasets, with an asynchronous inference stack, to control robots🧵

6

37

257

132

22.0K