Sayak Paul
@RisingSayak
ML at Hugging Face 🤗
Had the honor to present diffusion transformers at CS25, Stanford. The place is truly magical. Slides: bit.ly/dit-cs25 Recording: youtu.be/vXtapCFctTI?si… Thanks to @stevenyfeng for making it happen!
Fast LoRA inference for Flux with Diffusers and PEFT 🚨 There are great materials that demonstrate how to optimize inference for popular image generation models, such as Flux. However, very few cover how to serve LoRAs fast, despite LoRAs being an inseparable part of their…

Hardware design dictates the runtime performance of models, and it's still not being discussed as heavily even now in the context of diffusion models. Or does anyone know of a few already? P.S.: I know the scaling papers in diffusion and SD3 is the best one amongst them!

How can you maximize performance with torch.compile when working with Diffusers across different use cases? This blog shows how torch.compile can deliver significant speedups—even when using offloading and LoRAs. 🔗 Read here: hubs.la/Q03xKQTq0 From: @RisingSayak…
We show how the `torch.compile` support is being deepened in Diffusers while discussing: 1. Reducing cold-start timing with regional compilation 2. Making compilation work with offloading, quantization, and LoRA 3. Operationalizing compilation-related features 4. Practical…

Veo3 is a bit too good to trust. God! An astronaut riding a bicycle in the streets of Cotswold, England.
We overhauled and simplified Diffusers' benchmarking suite to report just the forward pass number of popular diffusion models -- Flux, SDXL, Wan, LTX. It makes sense because it's the most computationally intensive part of the iterative process. So, any improvement to it will…

Users of `torch.compile`. Some small performance tips: 1. Default to `fullgraph=True` to catch graph breaks as early as possible. 2. Check for recompilation triggers. Put your code under `torch._dynamo.config.patch(error_on_recompile=True)` context. 3. Use regional compilation…
Thanks to @adyaman's great contribution, `flux-fast` is now supported on AMD chips too 🔥 Same recipe, (almost) same code, and it just works 🦋 Jam here 🎸 github.com/huggingface/fl…
Make Flux go brrr on H100s without bells and whistles ⚡️ We're excited to provide a simple recipe, dubbed `flux-fast`, providing a 2.5x speedup on H100 GPUs. Kontext is also supported 🔥 Code: github.com/huggingface/fl… By Joel Schlosser (@PyTorch) & yours truly 🤗