EmbeddedLLM
@EmbeddedLLM
Your open-source AI ally. We specialize in integrating LLM into your business.
Pro-tip for vLLM power-users: free ≈ 90 % of your GPU VRAM in seconds—no restarts required🚀 🚩 Why you’ll want this • Hot-swap new checkpoints on the same card • Rotate multiple LLMs on one GPU (batch jobs, micro-services, A/B tests) • Stage-based pipelines that call…

🚀 vLLM v0.10.0 is LIVE! Faster, leaner, and more powerful. The TL;DR: ⚡️ Performance & Hardware Highlights: Experimental Async Scheduling: +3-15% throughput by overlapping scheduling & GPU execution. Huge AMD Gains (from our team!): +68.3% throughput on Deepseek-V3/R1 with Full…

So, I know a thing or two about ROCm by now. Therefore, I decided now is a good time to summarize my yappings about AMD, especially in the wake of Advancing AI, into a Substack article. Link in replies⬇️
Intern-S1 is supported in vLLM now, thanks for the joint efforts between the vLLM team and the InternLM team @intern_lm ♥️ The easy way: uv pip install vllm --extra-index-url wheels.vllm.ai/nightly vllm serve internlm/Intern-S1 --tensor-parallel-size 8 --trust-remote-code
🚀Introducing Intern-S1, our most advanced open-source multimodal reasoning model yet! 🥳Strong general-task capabilities + SOTA performance on scientific tasks, rivaling leading closed-source commercial models. 🥰Built upon a 235B MoE language model and a 6B Vision encoder.…
This amazing Attention-FFN disaggregation implementation from @StepFun_ai , achieves decoding throughput of up to 4,039 tokens per second per GPU under 50ms TPOT SLA, for their 321B-A38B MoE model Step3 served with H800! The implementation is based on vLLM, and we are working…
vLLM v0.10.0 just released, and its biggest feature might be a hidden gem: initial support for the OpenAI /responses API. It might sound like a small feature, but this is a huge market signal. The industry is moving in this direction for building the next generation of powerful,…

The @huggingface Transformers ↔️ @vllm_project integration just leveled up: Vision-Language Models are now supported out of the box! If the model is integrated into Transformers, you can now run it directly with vLLM. github.com/vllm-project/v… Great work @RTurganbay 👏
Llama 4 quantization support just landed in llm-compressor! ✅ W4A16 quantization ✅ FP4 quantization ✅ Support for Llama 4 tokenizer + model loading This sets the stage for fast, community-optimized Llama 4 models. Jump in to try, test, contribute: github.com/vllm-project/l…
The two biggest stories in Python performance just collided. vLLM now runs with no GIL.
vLLM runs on free-threaded Python! A group of engineers from @Meta’s Python runtime language team has shown that it’s possible to run vLLM on the nogil distribution of Python. We’re incredibly excited to embrace this future technique and be early adopters 😍
All the credit goes to the AMD ROCm teams working tirelessly with the feedback. Happy 4th. Run Free and Open.
Happy 4th July! Speed is the Moat & @AnushElangovan & his team Keeps Running Faster & Faster. Still lots of areas where ROCm has gaps but many are already closing
Minimax M1 is one of the SOTA open weight model from @MiniMax__AI. Checkout how is it efficiently implemented in vLLM, directly from the team! blog.vllm.ai/2025/06/30/min…
🔥 Another strong open model with Apache 2.0 license, this one from @MiniMax_AI - places in the top 15. MiniMax-M1 is now live on the Text Arena leaderboard landing at #12. This puts it at equal ranking with Deepseek V3/R1 and Qwen 3! See thread to learn more about its…
PyTorch and vLLM are both critical to the AI ecosystem and are increasingly being used together for cutting edge generative AI applications, including inference, post-training, and agentic systems at scale. 🔗 Learn more about PyTorch → vLLM integrations and what’s to come:…
Great discussions, @mgoin_! We're thrilled to partner with @RedHat_AI and @AMD to enhance @vllm_project. It's an honor to contribute to such a vibrant and global open-source community. Onwards!
vLLM is truly a global phenomenon. From San Francisco to Boston and New York, and across Tokyo, Singapore, and Beijing, meetups are packed with passionate AI developers pushing the boundaries of inference performance. We ❤️ this community!
Exciting first day talking about @vllm_project in Singapore! I had an great time discussing in depth with @EmbeddedLLM on how we will make @AMD better across the diverse features and workloads in vLLM. So thankful for our vibrant OSS community across the world 🫶
Let's goooo
vLLM has just reached 50K github stars! Huge thanks to the community!🚀 Together let's bring easy, fast, and cheap LLM serving for everyone✌🏻
Thank you @AMD @LisaSu @AnushElangovan for Advancing AI together with @vllm_project! We look forward to the continued partnership and pushing the boundary of inference.