Luis Ceze (@luisceze)

Pinned

Luis Ceze Retweeted

Z

Zihao Ye@ye_combinator · Mar 6

Check out the intra-kernel profiler in flashinfer to visualize the timeline of each SM/warpgroup in the lifecycle of a CUDA persistent kernel: github.com/flashinfer-ai/… You can clearly understand how tensor/cuda cores overlapping, variable length load-balancing and fusion works.

2

31

146

41

8.0K

Pinned

L

Luis Ceze@luisceze · Jul 25, 2024

Great to see @OctoAICloud only second to @GroqInc -- given our service runs on off-the-cloud-shelf @nvidia hardware. It is all about carefully balancing speed, quality and cost in from a whole-system, cross-stack perspective.

AAlex Volkov (Thursd/AI)@altryne · Jul 24, 2024

Wanna know whether different LLM providers serve the same LLama 3.1 70B? I sure did! So I ran a quick eval to get some surprising results + open sourced my code 👇 Check out my comparison between @GroqInc @FireworksAI_HQ @OctoAICloud @DeepInfra and @togethercompute

1

2

11

3

7.0K

L

Luis Ceze@luisceze · May 13

We’re thrilled that FlashInfer won a Best Paper Award at MLSys 2025! 🎉 This wouldn’t have been possible without the community — huge thanks to @lmsysorg’s sglang for deep co-design (which is crtical for inference kernel evolution) and stress-testing over the years, and to…

NNVIDIA AI Developer@NVIDIAAIDev · May 13

🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and…

15

37

232

27

38.0K

L

Luis Ceze@luisceze · May 14

Congrats to @ye_combinator @tqchenml @luisceze! Flashinfer has been the real power behind various inference frameworks! Hope to see more people joining the community and build your own inference engines on top of it!

ZZihao Ye@ye_combinator · May 13

We’re thrilled that FlashInfer won a Best Paper Award at MLSys 2025! 🎉 This wouldn’t have been possible without the community — huge thanks to @lmsysorg’s sglang for deep co-design (which is crtical for inference kernel evolution) and stress-testing over the years, and to…

1

4

52

4

12.0K

L

Luis Ceze@luisceze · May 13

🚀🎉

NNVIDIA AI Developer@NVIDIAAIDev · May 13

🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and…

0

3

10

0

1.0K

L

Luis Ceze@luisceze · Apr 24

@0xA95 @seanprime7 @vinodg ‘s work is finally out. Kick the tires and let them know what do you think!

CCristian Garcia@cgarciae88 · Apr 24

new JAX MPMD library from Nvidia

1

5

0

621

L

Luis Ceze@luisceze · Mar 11

LLM is not all about tensor cores. categorical sampling under filters (top-p/top-k/min-p) are critical operators in llms as vocabulary size grows, flashinfer uses sorting-free rejection sampling algorithm for efficient sampling. checkout this great blog post written by @0xsling0…

SShanli Xing@shanli_xing · Mar 11

🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling. Our implementation achieves over 50% reduction in sampling time. Blog post: flashinfer.ai/2025/03/10/sam…

0

9

39

10

5.0K

Luis Ceze Retweeted

S

Shanli Xing@shanli_xing · Mar 11

🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling. Our implementation achieves over 50% reduction in sampling time. Blog post: flashinfer.ai/2025/03/10/sam…

1

33

180

95

31.0K

Luis Ceze Retweeted

T

Tianqi Chen@tqchenml · Mar 7

Learn more about the latest advances in AI and systems, including LLM serving, efficient attentions, structured outputs, scaling up training, and more topics. Check out #MLSys2025. Accepted papers at mlsys.org/virtual/2025/p… and register today at mlsys.org/Register

4

25

103

16

16.0K

L

Luis Ceze@luisceze · Dec 19

Amazing to see Flashinfer’s traction in the short 8mo since it was first introduced. Try out the latest release.

ZZihao Ye@ye_combinator · Dec 19

We are excite to announce FlashInfer v0.2! Core contributions of this release include: - Block/Vector Sparse (Paged) Attention on FlashAttention-3 - JIT compilation for customized attention variants - Fused Multi-head Latent Attention (MLA) decoding kernel - Lots of bugfix and…

0

2

19

1

2.0K

L

Luis Ceze@luisceze · Oct 17

Fascinating to read about this analysis of how telenovelas have such a deep impact on real world culture — I’m brazilian :). As a computer scientist, reading TRIBAL by @MichaelMorrisCU makes me wonder about culture impact on AI and its co-evolution with human culture.

MMichael Morris, Professor at Columbia University@MichaelMorrisCU · Oct 13

📺Day 7: Fictional Characters and Real Change 📺 From Will & Grace to Brazilian telenovelas, widely watched dramas can precipitate dramatic cultural shifts. NGOs promoting public health changes have employed serial dramas to shift cultural ideals and personal decisions. But…

0

8

2

639

L

Luis Ceze@luisceze · Jul 23, 2024

Huge achievement by the @AIatMeta team on launching the Llama 3.1 models! The quality benchmarks look incredible, our customers are going to be really excited for the whole Llama 3.1 herd. Learn more and try them on @OctoAICloud here: octo.ai/blog/llama-3-1…. 🙏🚀🐙

AAI at Meta@AIatMeta · Jul 23, 2024

Starting today, open source is leading the way. Introducing Llama 3.1: Our most capable models yet. Today we’re releasing a collection of new Llama 3.1 models including our long awaited 405B. These models deliver improved reasoning capabilities, a larger 128K token context…

0

9

3

832

Luis Ceze Retweeted

T

Tiernan Ray@TiernanRayTech · Jun 3, 2024

More political deepfakes exist than you think, according to this AI expert With so many elections happening globally this year, TrueMedia founder Oren Etzioni hopes the company's deepfake detection tool can help reduce disinformation. Here's how. zdnet.com/article/ai-exp……

1

2

8

0

2.0K

L

Luis Ceze@luisceze · May 14, 2024

Go @abcdabcd987 (Lequn Chen)! Great work on making lots LoRAs cheap to serve. Nice collaboration with @ye_combinator @arvind_uw and others! #mlsys24 arxiv.org/abs/2310.18547

luisceze's tweet image. Go @abcdabcd987 (Lequn Chen)! Great work on making lots LoRAs cheap to serve. Nice collaboration with @ye_combinator @arvind_uw and others! #mlsys24 arxiv.org/abs/2310.18547

0

2

20

2

2.0K

L

Luis Ceze@luisceze · May 14, 2024

Great work Yilong, @cylinbao @ye_combinator @bariskasikci and team!

TTianqi Chen@tqchenml · May 14, 2024

Atom: low-bit quantization for efficient and accurate LLM serving. #MLSys2024 bringing efficient and accurate 4bit inference for serving scenarios.

0

4

0

950

Luis Ceze Retweeted

T

Tianqi Chen@tqchenml · Apr 19, 2024

#Llama3 🦙🦙 running fully locally on iPad without internet connnection. credits to @ruihanglai and the team

0

16

74

15

8.0K

L

Luis Ceze@luisceze · Apr 19, 2024

It is amazing how cheap we can go when it comes to running #Llama3 models from @AIatMeta , running on a $100 Orange Pi

MMengshiun@mengshyu · Apr 19, 2024

Deploy #Llama3 on $100 Orange Pi with GPU acceleration through MLC LLM. Try it out on your Orange Pi 👉 blog.mlc.ai/2023/08/09/GPU…

1

13

69

21

10.0K

Luis Ceze Retweeted

A

Allie K. Miller@alliekmiller · Apr 16, 2024

Fine-tuned open-sourced models are giving the AI giants a run for their money. @mattshumer_, CEO of HyperWrite, and I sat down with @OctoAICloud to talk about the major trends impacting fast-growing AI startups across open source, cost savings, and flexibility. ⏩️ This is…

1

11

42

38

9.0K

L

Luis Ceze@luisceze · Apr 3, 2024

Our SaaS customers love our full-stack approach to generative AI inference that is reliable, customizable, and efficient. OctoStack offers all these benefits directly in your environment - ultra-fast inference, model orchestration, and optimized up/down the stack. 🚀🐙

MMengshiun@mengshyu · Apr 19, 2024

Deploy #Llama3 on $100 Orange Pi with GPU acceleration through MLC LLM. Try it out on your Orange Pi 👉 blog.mlc.ai/2023/08/09/GPU…

0

3

0

962