Eldar Kurtić
@_EldarKurtic
Researcher on efficient inference @RedHat_AI & @ISTAustria
You can now find GuideLLM at its new address: github.com/vllm-project/g…
BIG NEWS! 🎉 GuideLLM is officially joining the @vLLM_project! This combines vLLM's high-speed inference with a powerful, dedicated toolkit for real-world performance validation. Moving from PoC to production just got a lot more scientific. Here's how, 1/7:
.@vllm_project office hours return next week! Alongside project updates from @mgoin_, vLLM committers and HPC experts @robertshaw21 + @tms_jr will share how to scale MoE models with llm-d and lessons from real world multi-node deployments. Register: red.ht/office-hours
FP4 models and inference kernels ready for Blackwell GPUs! GPTQ and Hadamard for accuracy, and fused Hadamard for runtime. Check out more details about our work in the thread below 👇
Announcing our early work on FP4 inference for LLMs! - QuTLASS: low-precision kernel support for Blackwell GPUs - FP-Quant: a flexible quantization harness for Llama/Qwen We reach 4x speedup vs BF16, with good accuracy through MXFP4 microscaling + fused Hadamard rotations.
The @huggingface folks deserve far more credit for being a pillar of open-source and still managing to push out SOTA results across the board, along with a full write-up of the entire model’s lifecycle.
We just released the best 3B model, 100% open-source, open dataset, architecture details, exact data mixtures and full training recipe including pre-training, mid-training, post-training, and synthetic data generation for everyone to train their own. Let's go open-source AI!
Want to learn more about GuideLLM, the tool used by @charles_irl and @modal_labs' LLM Engine Advisor to easily benchmark LLM inference stack? Join the next vLLM office hours with @ZelenovicSasa , @mgoin_ , Jenny Yi, and @markurtz_ . More details in the thread below 👇
Heard about GuideLLM? It's Red Hat’s open-source toolkit for benchmarking real-world LLM inference. Simulate traffic, measure throughput & latency, and ensure your deployment meets its service-level objectives (SLOs). Here's how it works and how to get started - a thread:
The recording of @egallen's and my @PyTorch Day France 2025 and @gosimfoundation talk, "Scaling LLM Inference with vLLM," is now available on PyTorch’s YouTube channel. youtube.com/watch?v=XYh6Xf…
Red Hat team absolutely smashing it!! Integration with axolotl is huge for training
🚨 Introducing the Axolotl-LLM Compressor integration, designed to make fine-tuning sparse models easier and more efficient than ever! Now you can fine-tune sparse models for specific data while preserving their sparse structure and recovering any accuracy lost during pruning.…
LLM-Compressor now integrated with Axolotl!
🚨 Introducing the Axolotl-LLM Compressor integration, designed to make fine-tuning sparse models easier and more efficient than ever! Now you can fine-tune sparse models for specific data while preserving their sparse structure and recovering any accuracy lost during pruning.…
Join me to hear about decentralised training, why it works and what opportunities it can unlock 🚀. Many thanks to @Sree_Harsha_N for the invitation!
@itsmaddox_j will be presenting at the ML Efficiency group - this is going to be a very fun session :D Andrej is super passionate and I hope this will be useful for anyone who'd like to make sense of why decentralized training is exciting. Join us cohere.com/events/Cohere-…
Want to quickly get a feeling for how fast an LLM runs under different workloads (and in different engines)? Look no further, @charles_irl and @modal_labs built a really cool app for it. Pro tip: don't skip the "Executive Summary" and "How to Benchmark", well worth the read!
Dozens of teams have asked my advice on running LLMs. How fast is @deepseek_ai V3 with vLLM on 8 GPUs? What's the max throughput of @Alibaba_Qwen 2.5 Coder with SGLang on one H100? Running & sharing benchmarks ad hoc was too slow So we built a tiny app, the LLM Engine Advisor
Today at 15:00 CEST, I’ll give a talk at OpenSource@Siemens on efficient inference with LLMs. 📺 The talk will be live-streamed at opensource.siemens.com, followed by a live Q&A. Feel free to tune in and bring your questions! It’s a tutorial-style session covering the basics…
Love this approach by @RedHat_AI. We need more trust & validation in AI and this can help! huggingface.co/RedHatAI
Major props to the contributors who made this release happen 🙌 @JoelNiklaus @_lewtun @ailozovskaya @clefourrier @alvind319 HERIUN @_EldarKurtic @mariagrandury jnanliu @qubvelx Check out the release & try it out: 🔗 github.com/huggingface/li…