Gaurav Parmar
@GauravTParmar
PhD @ CMU
[1/4] Ever wondered what it would be like to use images—rather than text—to generate object and background compositions? We introduce VisualComposer, a method for compositional image generation with object-level visual prompts.
🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n
🚀 Career Update After years pushing the boundaries of Generative AI at some of the world’s top companies -> I’m going startup. I’ve joined @DecartAI as a founding team member, leading the charge to build our San Francisco office from the ground up. decart.ai
Real-time video generation is finally real — without sacrificing quality. Introducing Self-Forcing, a new paradigm for training autoregressive diffusion models. The key to high quality? Simulate the inference process during training by unrolling transformers with KV caching.
🚀 How to run 12B FLUX.1 on your local laptop with 2-3× speedup? Come check out our #SVDQuant (#ICLR2025 Spotlight) poster session! 🎉 🗓️ When: Friday, Apr 25, 10–12:30 (Singapore time) 📍 Where: Hall 3 + Hall 2B, Poster 169 📌 Poster: tinyurl.com/poster-svdquant 🎮 Demo:…
🚀 The 4-bit era has arrived! Meet #SVDQuant, our new W4A4 quantization paradigm for diffusion models. Now, 12B FLUX can run on a 16GB 4090 laptop without offloading—with 3x speedups over W4A16 models (like NF4) while maintaining top-tier image quality. #AI #Quantization. 1/7
This is really cool
Decentralized Diffusion Models power stronger models trained on more accessible infrastructure. DDMs mitigate the networking bottleneck that locks training into expensive and power-hungry centralized clusters. They scale gracefully to billions of parameters and generate…
Text prompts have shaped how we compose images with foundation models. But what if we could simply inject Visual Prompts instead? We introduce 🌟Visual Composer🌟 which achieves high-fidelity compositions of subjects and backgrounds with visual prompts! snap-research.github.io/visual-compose…
[1/4] Ever wondered what it would be like to use images—rather than text—to generate object and background compositions? We introduce VisualComposer, a method for compositional image generation with object-level visual prompts.
One of the motivating application of this project was to emulate a "photo album" experience. With VisualComposer, you can create image variations from one image. But it also became a more general tool where you not only can generate image variations, but also compose any visual…
[1/4] Ever wondered what it would be like to use images—rather than text—to generate object and background compositions? We introduce VisualComposer, a method for compositional image generation with object-level visual prompts.
HandsOnVLM: An in-context action prediction assistant for daily activities. It enables predicting future interaction trajectories of human hands in a scene given natural language queries. Evaluations across 100s of diverse scenarios in homes, offices, and outdoors! 1/n
Current vision systems use fixed-length representations for all images. In contrast, human intelligence or LLMs (eg: OpenAI o1) adjust compute budgets based on the input. Since different images demand diff. processing & memory, how can we enable vision systems to be adaptive ? 🧵
As a founding researcher, I have seen @SkildAI grow exponentially. We changed 3 offices, grew 10x in human (and robot) numbers, and become a unicorn in less than a year. If you want to scale up robotics and work with a cracked team of engineers and scientists, come to @SkildAI.
Thrilled to announce @SkildAI! Over the past year, @gupta_abhinav_ and I have been working with our top-tier team to build an AI foundation model grounded in the physical world. Today, we’re taking Skild AI out of stealth with $300M in Series A funding: forbes.com/sites/rashishr…
The latent space of earlier generative models like GANS can linearly encode concepts of the data. What if the data was model weights? We present weights2weights, a subspace in diffusion weights that behaves as an interpretable latent space over customized diffusion models.
WALT3D has accepted as Oral at #cvpr (top 90 out of 12000) WALT3D:Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects under Occlusion Project Page: cs.cmu.edu/~walt3d Key Idea: Convert you image to 3D under severe Occlusions
Our new inversion method facilitates interactive image editing with few-step diffusion models 🏃♀️🏃 I played with it all morning, so much fun -- less than 2 sec per edit 😲 Try the demo! Project page: garibida.github.io/ReNoise-Invers… Cool demo: huggingface.co/spaces/garibid…
Introducing ReNoise Inversion! With the recent diffusion models trained to generate images with a few steps, interactive image editing is within our reach. Our method unlocks interactive image editing by inverting images to the noise space of fast diffusion models 🚀
Testing new pix2pix-Turbo in real-time, very interesting GAN architecture that leverages SD-Turbo model. Here I'm using edge2image LoRA single-step inference 🤯
[1/2] We’ve released the code for #pix2pixturbo and #CycleGANTurbo. These conditional GANs are able to adapt a text-to-image model such as SD-Turbo for both paired and unpaired image translation with a single step (0.11 sec on A100 and 0.29 sec on A6000). Try our code and the…
One-Step Image Translation with Text-to-Image Models In this work, we address two limitations of existing conditional diffusion models: their slow inference speed due to the iterative denoising process and their reliance on paired data for model fine-tuning.