Red Hat AI
@RedHat_AI
Deliver AI value with the resources you have, the insights you own, and the freedom you need.
LLM inference is too slow, too expensive, and too hard to scale. 🚨 Introducing llm-d, a Kubernetes-native distributed inference framework, to change that—using vLLM (@vllm_project), smart scheduling, and disaggregated compute. Here’s how it works—and how you can use it today:
Happening next week! Hear the @vllm_project update and lear how to scale MoE with @_llm_d_. Register: red.ht/office-hours

Serving LLMs at scale is tough. Slow response times, poor GPU utilization, and high costs get in the way. In this video, @mgoin_ explains how @vllm_project tackles these challenges with 24x throughput and efficient batching. See how it works and more: youtube.com/watch?v=lxjWiV…
Thrilled to see @Meta joining @thealliance_ai! We're excited to continue our work with Meta and all AI Alliance members as we collectively drive an open future for AI. 🤝
🚨 New open-source drop: The AI Alliance is now supporting Llama Stack, a modular AI application framework developed by Meta. Built for portability, developer choice, and real-world deployment. Details ⬇️ 🔗 thealliance.ai/blog/ai-allian…
llm-d organizes through 7 specialized teams (SIGs): 🔀 Inference Scheduler 📊 Benchmarking ⚡ PD-Disaggregation 🗄️ KV-Disaggregation 🚀 Installation 📈 Autoscaling 👀 Observability Weekly meetings, public docs, active Slack channels. Join today! llm-d.ai/docs/community…
.@vllm_project office hours return next week! Alongside project updates from @mgoin_, vLLM committers and HPC experts @robertshaw21 + @tms_jr will share how to scale MoE models with llm-d and lessons from real world multi-node deployments. Register: red.ht/office-hours

vLLM is finally addressing a long-standing problem: startup times 35s -> 2s for CUDA graph capture is a great reduction!
🎉 Huge congrats to @luka_govedic from @RedHat who’s now a core committer to @vllm_project! He’s led the torch.compile integration and custom passes, the vLLM startup time reduction initiative, AMD enablement, and more. A well-earned milestone 👏 github.com/ProExpertProg
Random Samples: Grounding Feedback is All You Need: Aligning Small Vision-Language Models x.com/i/broadcasts/1…
Llama 4 quantization support just landed in llm-compressor! ✅ W4A16 quantization ✅ FP4 quantization ✅ Support for Llama 4 tokenizer + model loading This sets the stage for fast, community-optimized Llama 4 models. Jump in to try, test, contribute: github.com/vllm-project/l…
Red Hat AI Inference Server allows you to run any LLM, on any accelerator, in any cloud environment. And it's all open source! Hear from Red Hat CEO @matthicksj on what this means for you and your business.
Red Hat AI Inference Server delivers our vision of running any gen AI model on any AI accelerator in any cloud environment. See how we're empowering our customers with @RedHat_AI: red.ht/4lyPlKd
How do you solve AI's biggest performance hurdles? On Technically Speaking, @kernelcdub & Nick Hill dive into vLLM, exploring how techniques like PagedAttention solve memory bottlenecks & accelerate inference: red.ht/4lDjJ5P.
Random Samples, our weekly seminar series that bridges the gap between cutting-edge AI research and real-world application, continue this Friday, July 18! Title: Grounding Feedback is All You Need: Aligning Small Vision-Language Models Abstract: While recent vision-language…

Minimax M1 is one of the SOTA open weight model from @MiniMax__AI. Checkout how is it efficiently implemented in vLLM, directly from the team! blog.vllm.ai/2025/06/30/min…
🔥 Another strong open model with Apache 2.0 license, this one from @MiniMax_AI - places in the top 15. MiniMax-M1 is now live on the Text Arena leaderboard landing at #12. This puts it at equal ranking with Deepseek V3/R1 and Qwen 3! See thread to learn more about its…
If you're curious where @RedHat fits into this whole AI thing, watch this quick interview with @matthicksj on @theCUBE: (spoiler: the answer is @RedHat_AI) youtu.be/dIe3-sfZfKc?si…
FP4 models and inference kernels ready for Blackwell GPUs! GPTQ and Hadamard for accuracy, and fused Hadamard for runtime. Check out more details about our work in the thread below 👇
Announcing our early work on FP4 inference for LLMs! - QuTLASS: low-precision kernel support for Blackwell GPUs - FP-Quant: a flexible quantization harness for Llama/Qwen We reach 4x speedup vs BF16, with good accuracy through MXFP4 microscaling + fused Hadamard rotations.
Random Samples: On scalable RL in the era of agentic LLMs x.com/i/broadcasts/1…
Nick Hill digs into the details of vLLM with me on Technically Speaking. Helpful in understanding why vLLM is so important in high performance, open source AI inferencing
How do you solve AI's biggest performance hurdles? On Technically Speaking, @kernelcdub & Nick Hill dive into vLLM, exploring how techniques like PagedAttention solve memory bottlenecks & accelerate inference: red.ht/4lDjJ5P.
Want to influence the future of llm-d? Our 5-min survey on real-world LLM use cases is open until July 11. We're reviewing the results live at our community meeting on July 16th, so your voice will be heard immediately. Make an impact: red.ht/llm-d-user-sur… #AI #MLOps #vllm
Red Hat and @AMD are bringing together the power of @RedHat_AI with AMD’s portfolio of high-performance computing architectures to support optimized, cost-efficient, and production-ready environments for AI-enabled workloads. Check it out. #RHSummit sprou.tt/10b87MeYUwM
Red Hat + @NVIDIA = a new wave of agentic AI innovation 💡 See how we're supporting NVIDIA Blackwell AI factories across @RedHat_AI and the hybrid cloud. #RHSummit sprou.tt/1ypiGRkRnJ5