LMSYS Org
@lmsysorg
Large Model Systems Organization: Join our Slack: https://slack.sglang.ai We developed SGLang https://sglang.ai, Chatbot Arena (now @lmarena_ai), and Vicuna!
🚀 Breaking: SGLang provides the first open-source implementation to serve @deepseek_ai V3/R1 models with large-scale expert parallelism and prefill-decode disaggregation on 96 GPUs. It nearly matches the throughput reported by the official DeepSeek blog, achieving 52.3K input…

Kudos to NVIDIA's Pen Li for his incredible support in our collaboration with SGLang—from B200 and GB200 NVL72 to 128 H200 GPU k8s cluster, he has continuously helped the SGLang team gain access, which makes large-scale LLM serving on cutting-edge hardware possible!🤗
Proud to support this lightning-fast launch⚡️ ️ Accelerated through #NVIDIADGX Cloud and in partnership with Moonshot AI, @SGLang, and @Oracle Open Model Engine, we helped bring Kimi K2 to customers just days after its debut. Now, organizations can “Think Smart” and scale MoE…
Proud to support this lightning-fast launch⚡️ ️ Accelerated through #NVIDIADGX Cloud and in partnership with Moonshot AI, @SGLang, and @Oracle Open Model Engine, we helped bring Kimi K2 to customers just days after its debut. Now, organizations can “Think Smart” and scale MoE…
🚨SGLang Summer Fest Bonus Drop🚨 Proud to share a joint effort from Mooncake by @Kimi_Moonshot, @Oracle , and SGLang: Kimi K2 trillion-scale deployment—running on 128 H200 GPUs sponsored by @NVIDIAAIDev DGX Cloud. OME + SGLang = MoE inference at production scale.👇
✅ We’re excited to support @Alibaba_Qwen’s Qwen3-Coder on SGLang! With tool call parser and expert parallelism enabled, it runs smoothly with flexible configurations. Just give it a try! 🔗 github.com/zhaochenyang20…
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…
This work is done by my PhD student Zijian and researchers at Nvidia! @LigengZhu @songhan_mit Big congrats! We are also integrating VLM into AReaL.
🚀Summer Fest Day 4: Turbocharging Vision-Language Models with SGLang + NVILA 4.4× throughput, 2.2× faster response time! We've integrated NVILA into SGLang, enabling high-performance, scalable serving of vision-language models. This unlocks a 4.4× TPS boost and significantly…
NVILA is available in SGLang👏🏻
🚀Summer Fest Day 4: Turbocharging Vision-Language Models with SGLang + NVILA 4.4× throughput, 2.2× faster response time! We've integrated NVILA into SGLang, enabling high-performance, scalable serving of vision-language models. This unlocks a 4.4× TPS boost and significantly…
🚀Summer Fest Day 3: Cost-Effective MoE Inference on CPU from Intel PyTorch team Deploying 671B DeepSeek R1 with zero GPUs? SGLang now supports high-performance CPU-only inference on Intel Xeon 6—enabling billion-scale MoE models like DeepSeek to run on commodity CPU servers.…
Congrats Mingfei and the team! Thanks to the excellent collaborations between @lmsysorg and Intel! #intel #sglang #xeon
🚀Summer Fest Day 3: Cost-Effective MoE Inference on CPU from Intel PyTorch team Deploying 671B DeepSeek R1 with zero GPUs? SGLang now supports high-performance CPU-only inference on Intel Xeon 6—enabling billion-scale MoE models like DeepSeek to run on commodity CPU servers.…