Perry Zhang
@PY_Z001
PhD student at UCSD CSE. Working on video generation architecture.
Don’t be distracted by RL and focus on developing better gradient estimator for E_p [f(X)]?
I learned a lot from NATTEN!
Watch my talk about NATTEN on @GPU_MODE this Saturday at 3PM ET / noon PT. I'll go over all the exciting new features we shipped very recently, especially our Hopper and Blackwell FNA kernels, now speeding up video / world models by up to 2.6X e2e! youtube.com/watch?v=mF_H_J
Watch my talk about NATTEN on @GPU_MODE this Saturday at 3PM ET / noon PT. I'll go over all the exciting new features we shipped very recently, especially our Hopper and Blackwell FNA kernels, now speeding up video / world models by up to 2.6X e2e! youtube.com/watch?v=mF_H_J
📣 We’ve had three papers accepted at #ICML2025, Hao-AI-Lab is sending @haozhangml to attend ICML in person😂! If you're around, please find Hao at the venue and chat with him about video diffusion, LLM agents, and efficient attention 👋🧠 🎬 Fast Video Generation with Sliding…
Heading to ICML next week (Monday - Thursday). Down to chat research, ideas, anything cool, or just hang 😄📍🎯
🚀 Attention is the bottleneck in video DiTs—5 s of 720p = 100K+ tokens, quadratic cost blows up fast. Sparse/linear attention is 🔑 for long-context world models. 🧠 Track relavent papers in our awsome-video-attention repo → github.com/hao-ai-lab/Aws… #WorldModel #VideoAI
🔧🤖 New wave of open-source LLMs like Deekseek-R1-0528 and Qwen3-235B-A22B are leveling up with stronger agentic performance. We test them in head-to-head gameplay — the upgraded Deekseek-R1-0528 outsmarts strong reasoning models like o4-mini across several games and it nearly…
I will be giving a talk in @GPU_MODE tomorrow (May 31 12pm PST) about FastVideo/STA/VSA. Come if you're interested! youtube.com/watch?v=x44iGp…

amazing!
(1/5) We’ve never enjoyed watching people chop Llamas into tiny pieces. So, we’re excited to be releasing our Low-Latency-Llama Megakernel! We run the whole forward pass in single kernel. Megakernels are faster & more humane. Here’s how to treat your Llamas ethically: (Joint…
Announcing FastVideo V1, a unified framework for accelerating video generation. FastVideo V1 offers: - A simple, consistent Python API - State of the art model performance optimizations - Optimized implementations of popular models Blog: hao-ai-lab.github.io/blogs/fastvide…
STA is accepted by ICML 2025!!
🎥 Videos DiTs are painfully slow, HunyuanVideo takes 16 min to generate a 5s 720P video on H100. 🤯 Announcing Sliding Tile Attention (STA): * Accelerate 3D full attention (FA3) by up to 10x * Slash the end-to-end time from 16 --> 5 mins * NO extra training. NO quality loss!…
Thrilled to share recent research from our fascinating lab members and collaborators at #ICLR2025! 🚀✨ Come say hi in our poster sessions and dive into discussions on LLM agents, reasoning, long-context training, efficient inference, and more. We’re excited to share, learn and…
Let me tell a real story of my own with @nvidia. Back in 2014, I was a wide-eyed first-year PhD student at CMU in @ericxing's lab, trying to train AlexNet on CPU (don’t ask why). I had zero access to GPUs. NVIDIA wasn’t yet "THE NVIDIA" we know today—no DGXs, no trillion-dollar…
We are beyond honored and thrilled to welcome the amazing new @nvidia DGX B200 💚 at @HDSIUCSD @haoailab. This generous gift from @nvidia is an incredible recognition and an opportunity for the UCSD MLSys community and @haoailab to push the boundaries of AI + System research. 💪
😍😍😍😍😍
We are beyond honored and thrilled to welcome the amazing new @nvidia DGX B200 💚 at @HDSIUCSD @haoailab. This generous gift from @nvidia is an incredible recognition and an opportunity for the UCSD MLSys community and @haoailab to push the boundaries of AI + System research. 💪
When Ilya Sutskever once explained why next-word prediction leads to intelligence, he made a metaphor: if you can piece together the clues and deduce the criminal’s name on the last page, you have a real understanding of the story. 🕵️♂️ Inspired by that idea, we turned to Ace…
sorry, but you don’t get to insult people at @AIatMeta after being the most closed source AI company in the world for the last two years. meta is literally THE company with @MistralAI that stayed true to their open source commitment
#GTC25 featured our OSDI'24 work on disaggregated inference! 🚀🚀🚀 Who would have known that disaggregated prefill will become such a core technology in the next generation LLM inference oin year ago! x.com/haoailab/statu…
Still optimizing throughput for LLM Serving? Think again: Goodput might be a better choice! Splitting prefill from decode to different GPUs yields - up to 4.48x goodput - up to 10.2x stricter latency criteria Blog: hao-ai-lab.github.io/blogs/distserv… Paper: arxiv.org/abs/2401.09670
Beyond thrilled 🚀 to see my lab's work DistServe (OSDI'24) just got featured in Jensen Huang's keynote at Nvidia GTC! This marks our third major breakthrough in LLM inference after PagedAttention (vLLM) and Lookahead Decoding — pushing the frontier yet again! Since we post the…
Spotted @vllm_project during Jensen's Keynote @nvidia #GTC
Thank you @_akhaliq for sharing! Introducing our newest CVPR 2025 paper EgoLife: Towards Egocentric Life Assistant Homepage: egolife-ai.github.io Blog: egolife-ai.github.io/blog/ Code: github.com/EvolvingLMMs-L… EgoLife is a project focused on building AI-powered egocentric life…
EgoLife Towards Egocentric Life Assistant introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses
Claude-3.7 was tested on Pokémon Red, but what about more real-time games like Super Mario 🍄🌟? We threw AI gaming agents into LIVE Super Mario games and found Claude-3.7 outperformed other models with simple heuristics. 🤯 Claude-3.5 is also strong, but less capable of…