Luis Ceze
@luisceze
computer architect. marveled by biology. professor @uwcse. ceo @OctoAICloud. venture partner @madronaventures.
Check out the intra-kernel profiler in flashinfer to visualize the timeline of each SM/warpgroup in the lifecycle of a CUDA persistent kernel: github.com/flashinfer-ai/… You can clearly understand how tensor/cuda cores overlapping, variable length load-balancing and fusion works.
Great to see @OctoAICloud only second to @GroqInc -- given our service runs on off-the-cloud-shelf @nvidia hardware. It is all about carefully balancing speed, quality and cost in from a whole-system, cross-stack perspective.
Wanna know whether different LLM providers serve the same LLama 3.1 70B? I sure did! So I ran a quick eval to get some surprising results + open sourced my code 👇 Check out my comparison between @GroqInc @FireworksAI_HQ @OctoAICloud @DeepInfra and @togethercompute
We’re thrilled that FlashInfer won a Best Paper Award at MLSys 2025! 🎉 This wouldn’t have been possible without the community — huge thanks to @lmsysorg’s sglang for deep co-design (which is crtical for inference kernel evolution) and stress-testing over the years, and to…
🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and…
Congrats to @ye_combinator @tqchenml @luisceze! Flashinfer has been the real power behind various inference frameworks! Hope to see more people joining the community and build your own inference engines on top of it!
We’re thrilled that FlashInfer won a Best Paper Award at MLSys 2025! 🎉 This wouldn’t have been possible without the community — huge thanks to @lmsysorg’s sglang for deep co-design (which is crtical for inference kernel evolution) and stress-testing over the years, and to…
🚀🎉
🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and…
@0xA95 @seanprime7 @vinodg ‘s work is finally out. Kick the tires and let them know what do you think!
new JAX MPMD library from Nvidia
LLM is not all about tensor cores. categorical sampling under filters (top-p/top-k/min-p) are critical operators in llms as vocabulary size grows, flashinfer uses sorting-free rejection sampling algorithm for efficient sampling. checkout this great blog post written by @0xsling0…
🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling. Our implementation achieves over 50% reduction in sampling time. Blog post: flashinfer.ai/2025/03/10/sam…
🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling. Our implementation achieves over 50% reduction in sampling time. Blog post: flashinfer.ai/2025/03/10/sam…
Learn more about the latest advances in AI and systems, including LLM serving, efficient attentions, structured outputs, scaling up training, and more topics. Check out #MLSys2025. Accepted papers at mlsys.org/virtual/2025/p… and register today at mlsys.org/Register
Amazing to see Flashinfer’s traction in the short 8mo since it was first introduced. Try out the latest release.
We are excite to announce FlashInfer v0.2! Core contributions of this release include: - Block/Vector Sparse (Paged) Attention on FlashAttention-3 - JIT compilation for customized attention variants - Fused Multi-head Latent Attention (MLA) decoding kernel - Lots of bugfix and…
Fascinating to read about this analysis of how telenovelas have such a deep impact on real world culture — I’m brazilian :). As a computer scientist, reading TRIBAL by @MichaelMorrisCU makes me wonder about culture impact on AI and its co-evolution with human culture.
📺Day 7: Fictional Characters and Real Change 📺 From Will & Grace to Brazilian telenovelas, widely watched dramas can precipitate dramatic cultural shifts. NGOs promoting public health changes have employed serial dramas to shift cultural ideals and personal decisions. But…
Huge achievement by the @AIatMeta team on launching the Llama 3.1 models! The quality benchmarks look incredible, our customers are going to be really excited for the whole Llama 3.1 herd. Learn more and try them on @OctoAICloud here: octo.ai/blog/llama-3-1…. 🙏🚀🐙
Starting today, open source is leading the way. Introducing Llama 3.1: Our most capable models yet. Today we’re releasing a collection of new Llama 3.1 models including our long awaited 405B. These models deliver improved reasoning capabilities, a larger 128K token context…
More political deepfakes exist than you think, according to this AI expert With so many elections happening globally this year, TrueMedia founder Oren Etzioni hopes the company's deepfake detection tool can help reduce disinformation. Here's how. zdnet.com/article/ai-exp……
Go @abcdabcd987 (Lequn Chen)! Great work on making lots LoRAs cheap to serve. Nice collaboration with @ye_combinator @arvind_uw and others! #mlsys24 arxiv.org/abs/2310.18547

Great work Yilong, @cylinbao @ye_combinator @bariskasikci and team!
Atom: low-bit quantization for efficient and accurate LLM serving. #MLSys2024 bringing efficient and accurate 4bit inference for serving scenarios.
#Llama3 🦙🦙 running fully locally on iPad without internet connnection. credits to @ruihanglai and the team
It is amazing how cheap we can go when it comes to running #Llama3 models from @AIatMeta , running on a $100 Orange Pi
Deploy #Llama3 on $100 Orange Pi with GPU acceleration through MLC LLM. Try it out on your Orange Pi 👉 blog.mlc.ai/2023/08/09/GPU…
Fine-tuned open-sourced models are giving the AI giants a run for their money. @mattshumer_, CEO of HyperWrite, and I sat down with @OctoAICloud to talk about the major trends impacting fast-growing AI startups across open source, cost savings, and flexibility. ⏩️ This is…
Our SaaS customers love our full-stack approach to generative AI inference that is reliable, customizable, and efficient. OctoStack offers all these benefits directly in your environment - ultra-fast inference, model orchestration, and optimized up/down the stack. 🚀🐙
Deploy #Llama3 on $100 Orange Pi with GPU acceleration through MLC LLM. Try it out on your Orange Pi 👉 blog.mlc.ai/2023/08/09/GPU…