Dan Busbridge @ ICML Vancouver 🇨🇦

@danbusbridge

Machine Learning Research @ Apple (opinions are my own)

London, United Kingdom

Joined November 2014

917Following

858Followers

Pinned

Dan Busbridge @ ICML Vancouver 🇨🇦@danbusbridge · Feb 13

Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering: "If I want a small, capable model, should I distill from a more powerful model, or train from scratch?" Our distillation scaling law shows, well, it's complicated... 🧵 arxiv.org/abs/2502.08606

147

1.0K

122.0K

Pinned

Dan Busbridge @ ICML Vancouver 🇨🇦@danbusbridge · Jul 16

Happing now in East Exhibition Hall E-2310, with @AmitisShidani1, looking forward to discussing our work!

379

Dan Busbridge @ ICML Vancouver 🇨🇦@danbusbridge · Jul 15

Data mixtures are crucial for achieving strong pre-trained models. Loved collaborating on this project led by @PierreAblin and @MustafaShukor1 tackling data mixing ratios through the lens of scaling laws. Check out @MustafaShukor1's 🧵.

MMustafa Shukor@MustafaShukor1 · Jul 15

We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵

2.0K

Dan Busbridge @ ICML Vancouver 🇨🇦@danbusbridge · Jul 13

Happening in 30 minutes in West Ballroom A - looking forward to sharing our work on Distillation Scaling Laws!

DDan Busbridge @ ICML Vancouver 🇨🇦@danbusbridge · Jul 12

Excited to be heading to Vancouver for #ICML2025 next week! I'll be giving a deep dive on Distillation Scaling Laws at the expo — exploring when and how small models can match the performance of large ones. 📍 Sunday, July 13, 5pm, West Ballroom A 🔗 icml.cc/virtual/2025/4…

102

11.0K

Dan Busbridge @ ICML Vancouver 🇨🇦@danbusbridge · Apr 24

Stop by poster #596 at 10A-1230P tomorrow (Fri 25 April) at #ICLR2025 to hear more about Sigmoid Attention! We just pushed 8 trajectory checkpoints each for two 7B LLMs for Sigmoid Attention and a 1:1 Softmax Attention (trained with a deterministic dataloader for 1T tokens): -…

JJason Ramapuram@jramapuram · Jan 19

Small update on SigmoidAttn (arXiV incoming). - 1B and 7B LLM results added and stabilized. - Hybrid Norm [on embed dim, not seq dim], `x + norm(sigmoid(QK^T / sqrt(d_{qk}))V)`, stablizes longer sequence (n=4096) and larger models (7B). H-norm used with Grok-1 for example.

9.0K

Dan Busbridge @ ICML Vancouver 🇨🇦@danbusbridge · Apr 14

I’ve been curious about how early vs late-fusion multimodal approaches compare in controlled conditions. Great to see this studied in depth. Turns out, optimal late fusion has higher params-to-data, and performance between early and late fusion is similar. Brilliant work from…

MMustafa Shukor@MustafaShukor1 · Apr 11

We release a large scale study to answer the following: - Is late fusion inherently better than early fusion for multimodal models? - How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧵

4.0K

Dan Busbridge @ ICML Vancouver 🇨🇦 Retweeted

Eeshan Gunesh Dhekane@EeshanDhekane · Feb 13

Parameterized Transforms 🚀 Here is a new tool that provides a modular and extendable implementation of torchvision-based image augmentations that provides access to their parameterization. [1/5]

2.0K