Mihir Prabhudesai (@mihirp98)

Pinned

M

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

mihirp98's tweet image. 🚨 The era of infinite internet data is ending, So we ask:

👉 What’s the right generative modelling objective when data—not compute—is the bottleneck?

TL;DR:

▶️Compute-constrained? Train Autoregressive models

▶️Data-constrained? Train Diffusion models

Get ready for 🤿 1/n

116

128

737

639

105.0K

M

Mihir Prabhudesai@mihirp98 · 10 h

This looks pretty amazing work, especially the data repetition plots.

MMihir Prabhudesai@mihirp98 · 22 h

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

0

1

3

1

319

M

Mihir Prabhudesai@mihirp98 · 9 h

Great work from Mihir with lots of nice insights in the thread!

MMihir Prabhudesai@mihirp98 · 22 h

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

0

1

3

0

316

M

Mihir Prabhudesai@mihirp98 · 7 h

Very nice results here, we've seen similar ability of diffusion models to benefit from repeated data, more so than AR models. The diffusion loss is much noisier so might act as a natural regularizer to prevent overfitting.

MMihir Prabhudesai@mihirp98 · 22 h

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

0

1

17

2

1.0K

M

Mihir Prabhudesai@mihirp98 · 13 h

Interesting results and easy to follow thread for TLDR.

MMihir Prabhudesai@mihirp98 · 22 h

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

0

2

1

508

Mihir Prabhudesai Retweeted

Y

You Jiacheng@YouJiacheng · 17 h

I re-investigate the lr, it seems that 2e-4 is pretty safe for your model family because they are much smaller than the NeurIPS 2023 paper or Chinchilla. It's just a bit conservative lol. anyway 3.55 vs. 3.71 is pretty significant so I guess optimal HPs won't change it.

0

1

0

538

M

Mihir Prabhudesai@mihirp98 · 16 h

very impactful work, it tells us the cross-over DOES exist. big for our community.

MMihir Prabhudesai@mihirp98 · 22 h

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

1

3

28

3

2.0K

M

Mihir Prabhudesai@mihirp98 · 16 h

Really interesting paper and insight to how the language models might progress if we approach bottlenecks in scaling laws

MMihir Prabhudesai@mihirp98 · 22 h

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

0

1

4

0

386

M

Mihir Prabhudesai@mihirp98 · 18 h

We had some early evidence of this when over-training MDLM baselines in the original paper, but we didn't have time to explore it then. Glad to see discrete diffusion scales up so well in data constrained settings!

MMihir Prabhudesai@mihirp98 · 22 h

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

0

2

7

1

949

M

Mihir Prabhudesai@mihirp98 · 18 h

Great to see this tribal knowledge on more solid ground! We’ve seen this in several settings and I expect this to inform what we (in the general sense) choose to scale!

MMihir Prabhudesai@mihirp98 · 22 h

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

0

1

8

3

849

M

Mihir Prabhudesai@mihirp98 · 19 h

Exciting analysis on diffusion vs. auto-regressive modeling, with a surprisingly clean takeaway! Hope to see similar analyses in more sequential-decision-making-y / robotics-y problems.

MMihir Prabhudesai@mihirp98 · 22 h

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

0

1

8

2

973

M

Mihir Prabhudesai@mihirp98 · 19 h

Really interesting paper. Fits the theme that we should make our modeling problems harder, not easier, so that they are forced to learn more and generalize better.

MMihir Prabhudesai@mihirp98 · 22 h

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

1

12

208

106

15.0K

M

Mihir Prabhudesai@mihirp98 · 19 h

was actually wondering with @hyundongleee the fundamental differences between diffusion and autoregressive modeling other than the structure imposed in the modeling of the sequential conditional distribution and how they manifest. a poignant paper that addresses this thought

MMihir Prabhudesai@mihirp98 · 22 h

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

0

1

13

2

796

M

Mihir Prabhudesai@mihirp98 · 19 h

Thrilled to finally release this study! 🚀 We view (discrete) diffusion models as implicitly doing data augmentation over autoregressive. Through this lens, we find that diffusion outperforms AR in data-constrained settings, but it requires larger models and way more epochs to…

MMihir Prabhudesai@mihirp98 · 22 h

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

7

22

212

139

19.0K

M

Mihir Prabhudesai@mihirp98 · 19 h

Excellent insights on when to use autoregressive vs diffusion models

MMihir Prabhudesai@mihirp98 · 22 h

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

0

2

13

7

1.0K

M

Mihir Prabhudesai@mihirp98 · 21 h

Great research work. The thread is a gold mine for anyone interested in understanding diffusion language modelling and how it fares with AR models!

MMihir Prabhudesai@mihirp98 · 22 h

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

0

1

7

4

605

M

Mihir Prabhudesai@mihirp98 · 21 h

Great work from great people! @mihirp98 @pathak2206 AR aligns w/ compression theory (KC, MDL, arithmetic coding), but diffusion is MLE too. Can we interpret diffusion similarly? Curious how compression explains AR vs. diffusion scaling laws. (Ilya’s talk touches on this too.)

MMihir Prabhudesai@mihirp98 · 22 h

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

1

2

12

2

1.0K

M

Mihir Prabhudesai@mihirp98 · 22 h

Excited to share this project! Model keeps growing, but data doesn’t. So we ask: is there a more data-efficient way to achieve great performance? Diffusion is our answer for the first step of exploration.

MMihir Prabhudesai@mihirp98 · 22 h

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

0

1

6

0

451