Sayak Paul

@RisingSayak

ML at Hugging Face 🤗

Earth

Joined May 2012

125Following

20KFollowers

Pinned

Sayak Paul@RisingSayak · Jul 4

Had the honor to present diffusion transformers at CS25, Stanford. The place is truly magical. Slides: bit.ly/dit-cs25 Recording: youtu.be/vXtapCFctTI?si… Thanks to @stevenyfeng for making it happen!

128

1.0K

155.0K

Sayak Paul@RisingSayak · Jul 24

Fast LoRA inference for Flux with Diffusers and PEFT 🚨 There are great materials that demonstrate how to optimize inference for popular image generation models, such as Flux. However, very few cover how to serve LoRAs fast, despite LoRAs being an inseparable part of their…

RisingSayak's tweet image. Fast LoRA inference for Flux with Diffusers and PEFT 🚨

There are great materials that demonstrate how to optimize inference for popular image generation models, such as Flux. However, very few cover how to serve LoRAs fast, despite LoRAs being an inseparable part of their…

5.0K

Sayak Paul@RisingSayak · Jul 21

Hardware design dictates the runtime performance of models, and it's still not being discussed as heavily even now in the context of diffusion models. Or does anyone know of a few already? P.S.: I know the scaling papers in diffusion and SD3 is the best one amongst them!

RisingSayak's tweet image. Hardware design dictates the runtime performance of models, and it's still not being discussed as heavily even now in the context of diffusion models.

Or does anyone know of a few already?

P.S.: I know the scaling papers in diffusion and SD3 is the best one amongst them!

1.0K

Sayak Paul Retweeted

PyTorch@PyTorch · Jul 17

How can you maximize performance with torch.compile when working with Diffusers across different use cases? This blog shows how torch.compile can deliver significant speedups—even when using offloading and LoRAs. 🔗 Read here: hubs.la/Q03xKQTq0 From: @RisingSayak…

9.0K

Sayak Paul@RisingSayak · Jul 17

We show how the `torch.compile` support is being deepened in Diffusers while discussing: 1. Reducing cold-start timing with regional compilation 2. Making compilation work with offloading, quantization, and LoRA 3. Operationalizing compilation-related features 4. Practical…

RisingSayak's tweet image. We show how the `torch.compile` support is being deepened in Diffusers while discussing:

1. Reducing cold-start timing with regional compilation
2. Making compilation work with offloading, quantization, and LoRA
3. Operationalizing compilation-related features
4. Practical…

1.0K

Sayak Paul@RisingSayak · Jul 13

Veo3 is a bit too good to trust. God! An astronaut riding a bicycle in the streets of Cotswold, England.

2.0K

Sayak Paul@RisingSayak · Jul 13

We overhauled and simplified Diffusers' benchmarking suite to report just the forward pass number of popular diffusion models -- Flux, SDXL, Wan, LTX. It makes sense because it's the most computationally intensive part of the iterative process. So, any improvement to it will…

RisingSayak's tweet image. We overhauled and simplified Diffusers' benchmarking suite to report just the forward pass number of popular diffusion models -- Flux, SDXL, Wan, LTX.

It makes sense because it's the most computationally intensive part of the iterative process. So, any improvement to it will…

1.0K

Sayak Paul@RisingSayak · Jul 11

Users of `torch.compile`. Some small performance tips: 1. Default to `fullgraph=True` to catch graph breaks as early as possible. 2. Check for recompilation triggers. Put your code under `torch._dynamo.config.patch(error_on_recompile=True)` context. 3. Use regional compilation…

228

205

26.0K

Sayak Paul@RisingSayak · Jul 2

Thanks to @adyaman's great contribution, `flux-fast` is now supported on AMD chips too 🔥 Same recipe, (almost) same code, and it just works 🦋 Jam here 🎸 github.com/huggingface/fl…

RisingSayak's tweet card. Making Flux go brrr on GPUs. Contribute to huggingface/flux-fast development by creating an account on GitHub.

3.0K

Sayak Paul Retweeted

Sayak Paul@RisingSayak · Jun 30

Make Flux go brrr on H100s without bells and whistles ⚡️ We're excited to provide a simple recipe, dubbed `flux-fast`, providing a 2.5x speedup on H100 GPUs. Kontext is also supported 🔥 Code: github.com/huggingface/fl… By Joel Schlosser (@PyTorch) & yours truly 🤗

106

31.0K