vLLM

@vllm_project

A high-throughput and memory-efficient inference and serving engine for LLMs. Join http://slack.vllm.ai to discuss together with the community!

Joined March 2024

19Following

15KFollowers

vLLM Retweeted

Casper Hansen@casper_hansen_ · Jul 22

vLLM is finally addressing a long-standing problem: startup times 35s -> 2s for CUDA graph capture is a great reduction!

584

153

32.0K

vLLM@vllm_project · 21 h

✅ Try out @Alibaba_Qwen 3 Coder on vLLM nightly with "qwen3_coder" tool call parser! Additionally, vLLM offers expert parallelism so you can run this model in flexible configurations where it fits.

QQwen@Alibaba_Qwen · 22 h

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…

213

27.0K

vLLM@vllm_project · 23 h

The @huggingface Transformers ↔️ @vllm_project integration just leveled up: Vision-Language Models are now supported out of the box! If the model is integrated into Transformers, you can now run it directly with vLLM. github.com/vllm-project/v… Great work @RTurganbay 👏

vllm_project's tweet image. The @huggingface Transformers ↔️ @vllm_project integration just leveled up: Vision-Language Models are now supported out of the box!

If the model is integrated into Transformers, you can now run it directly with vLLM.

github.com/vllm-project/v…

Great work @RTurganbay 👏

232

18.0K

vLLM@vllm_project · Jul 22

If you're building with @vllm_project, speak at the dedicated vLLM track at Ray Summit in November.

rray@raydistributed · Jul 22

Last year, the creators of @vllm_project at UC Berkeley hosted a massive two-day vLLM event featuring presentations from Roblox, Uber, Apple, Intel, Alibaba, Neural Magic, IBM, Handshake, Databricks, Anyscale, and others on how they are using and optimizing vLLM. This covered…

2.0K

vLLM@vllm_project · Jul 19

Thanks for the great write-up! 🙌 Prefix caching is critical for agentic workflows like @ManusAI_HQ , and vLLM makes it seamless. ✅ prefix caching is enabled by default with an efficient implementation ✅ Append-only context? Cache hit heaven Context engineering FTW 🚀

YYichao 'Peak' Ji@peakji · Jul 18

After four overhauls and millions of real-world sessions, here are the lessons we learned about context engineering for AI agents: manus.im/blog/Context-E…

192

14.0K

vLLM Retweeted

NVIDIA AI Developer@NVIDIAAIDev · Jul 18

🎉Congratulations to @Microsoft for the new Phi-4-mini-flash-reasoning model trained on NVIDIA H100 and A100 GPUs. This latest edition to the Phi family provides developers with a new model optimized for high-throughput and low-latency reasoning in resource-constrained…

196

11.0K

vLLM Retweeted

Erik Kaunismäki@ErikKaum · Jul 17

We just released native support for @sgl_project and @vllm_project in Inference Endpoints 🔥 Inference Endpoints is becoming the central place where you deploy high performance Inference Engines. And that provides the managed infra for it so you can focus on your users.

13.0K

vLLM Retweeted

EmbeddedLLM@EmbeddedLLM · Jul 8

Pro-tip for vLLM power-users: free ≈ 90 % of your GPU VRAM in seconds—no restarts required🚀 🚩 Why you’ll want this • Hot-swap new checkpoints on the same card • Rotate multiple LLMs on one GPU (batch jobs, micro-services, A/B tests) • Stage-based pipelines that call…

6.0K

vLLM@vllm_project · Jul 8

vLLM runs on free-threaded Python! A group of engineers from @Meta’s Python runtime language team has shown that it’s possible to run vLLM on the nogil distribution of Python. We’re incredibly excited to embrace this future technique and be early adopters 😍

vllm_project's tweet image. vLLM runs on free-threaded Python! A group of engineers from @Meta’s Python runtime language team has shown that it’s possible to run vLLM on the nogil distribution of Python. We’re incredibly excited to embrace this future technique and be early adopters 😍

595

245

47.0K

vLLM@vllm_project · Jul 7

We hear your voice! For minimax in particular, github.com/vllm-project/v… forces the lm head to be fp32, which resumes accuracy but takes a lot memory. We are experimenting to see if dynamically casting fp16/bf16 to fp32 in the kernel helps the accuracy of the logits.

ffinbarr@finbarrtimbers · Jul 4

horrifying bug of the day is finding out the vllm and huggingface produce significantly different logprobs discuss.vllm.ai/t/numerical-di…

7.0K

vLLM Retweeted

MiniMax (official)@MiniMax__AI · Jul 4

🚨 MiniMax-M1 Technical Seminar Join us for our first official seminar — a deep dive into the world’s first open-weight hybrid-attention reasoning model, with 1M-token input & 80K-token output. 🧠 Experts from MiniMax, Anthropic, Hugging Face, vLLM, MIT, HKUST & more 📅 July 10…

273

2.5M

vLLM@vllm_project · Jul 2

💎What makes @vllm_project the Rolls Royce of inference? 👇🏻 We break it down in 5 performance-packed layers😎 ✅ PagedAttention, Prefix Caching, Chunked Prefill ✅ Continuous Batching, Speculative Decoding ✅ Flash Attention, FlashInfer ✅ Tensor/data & Pipeline Parallelism

KKosseila (CloudDude) ☁️📡🍉@CloudDude_ · Jul 2

🚀#NewBlog @vllm_project 🔥 𝐯𝐋𝐋𝐌 𝐟𝐨𝐫 𝐁𝐞𝐠𝐢𝐧𝐧𝐞𝐫𝐬 𝐏𝐚𝐫𝐭 𝟐:📖𝐊𝐞𝐲 𝐅𝐞𝐚𝐭𝐮𝐫𝐞𝐬 & 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧s💫 💎 What makes #vLLM the Rolls Royce of inference? 👉check it out: cloudthrill.ca/what-is-vllm-f… @vllm_project @lmcache #LLMPerformance

2.0K

vLLM@vllm_project · Jul 1

Minimax M1 is one of the SOTA open weight model from @MiniMax__AI. Checkout how is it efficiently implemented in vLLM, directly from the team! blog.vllm.ai/2025/06/30/min…

llmarena.ai@lmarena_ai · Jun 27

🔥 Another strong open model with Apache 2.0 license, this one from @MiniMax_AI - places in the top 15. MiniMax-M1 is now live on the Text Arena leaderboard landing at #12. This puts it at equal ranking with Deepseek V3/R1 and Qwen 3! See thread to learn more about its…

120

20.0K

vLLM@vllm_project · Jul 1

"The second most intelligent open weights model after DeepSeek R1, with a much longer 1M token context window!" Checkout the blog post from @MiniMax__AI on how the model is implemented on vLLM, and how you can run this model efficiently! blog.vllm.ai/2025/06/30/min…

AArtificial Analysis@ArtificialAnlys · Jun 18

MiniMax launches their first reasoning model: MiniMax M1, the second most intelligent open weights model after DeepSeek R1, with a much longer 1M token context window @MiniMax__AI M1 is based on their Text-01 model (released 14 Jan 2025) - an MoE with 456B total and 45.9B active…

154

10.0K

vLLM Retweeted

PyTorch@PyTorch · Jun 25

PyTorch and vLLM are both critical to the AI ecosystem and are increasingly being used together for cutting edge generative AI applications, including inference, post-training, and agentic systems at scale. 🔗 Learn more about PyTorch → vLLM integrations and what’s to come:…

317

20.0K