Infini-AI-Lab
@InfiniAILab
Huge thanks to @tinytitans_icml for an amazing workshop — see you next year! Honored to receive a Best Paper Award 🏆 Let’s unlock the potential of sparsity! Next up: scaling to hundreds/thousands of rollouts? Or making powerful R1/K2-level LLMs (not just 8B 4-bit models) run…

Introducing Weaver, a test time scaling method for verification! Weaver shrinks the generation-verification gap through a low-overhead weak-to-strong optimization of a mixture of verifiers (e.g., LM judges and reward models). The Weavered mixture can be distilled into a tiny…
#MLSys2026 will be led by the general chair @luisceze and PC chairs @JiaZhihao and @achowdhery. The conference will be held in Bellevue on Seattle's east side. Consider submitting and bringing your latest works in AI and systems—more details at mlsys.org.
📢Exciting updates from #MLSys2025! All session recordings are now available and free to watch at mlsys.org. We’re also thrilled to announce that #MLSys2026 will be held in Seattle next May—submissions open next month with a deadline of Oct 30. We look forward to…
This is cool!!!
We built sparse-frontier — a clean abstraction that lets you focus on your custom sparse attention implementation while automatically inheriting vLLM’s optimizations and model support. As a PhD student, I've learned that sometimes the bottleneck in research isn't ideas — it's…
Great to see a lot of interest! It takes some time to construct the superpositional encoding correctly, and to make it compatible with popular positional embeddings. So it is not super obvious😁. More interestingly, our experiments show that such superpositional encodings…
It is intuitively obvious that reasoning in continuous embedding space is dramatically more powerful than reasoning in discrete token space. This paper from @tydsh and team show that it is the case theoretically.
🐳DeepSeek-R1 just got more accessible Introducing our new cost-optimized endpoint for DeepSeek-R1 0528: ✨ High-quality reasoning ✨ $0.55/$2.19 per million tokens ✨ No quality compromises Perfect for developers needing powerful reasoning at accessible pricing 💰
wow 🤩 check this out!!!
One of the best ways to reduce LLM latency is by fusing all computation and communication into a single GPU megakernel. But writing megakernels by hand is extremely hard. 🚀Introducing Mirage Persistent Kernel (MPK), a compiler that automatically transforms LLMs into optimized…
Recordings: youtube.com/watch?v=TPz3OF… Slides: asap-seminar.github.io/assets/slides/…
Tomorrow at 2 PM Eastern Time, the ASAP seminar will feature @Xinyu2ML presenting an exciting work on parallel reasoning. (Xinyu is also a co-organizer of the seminar series—and said he'll be hosting himself, lol.)
Tomorrow at 2 PM Eastern Time, the ASAP seminar will feature @Xinyu2ML presenting an exciting work on parallel reasoning. (Xinyu is also a co-organizer of the seminar series—and said he'll be hosting himself, lol.)
@Xinyu2ML will be presenting this amazing work at ASAP seminar tomorrow! Do not miss his talk
🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46% 🌐 Website: multiverse4fm.github.io 🧵 1/n