Lijie(Derrick) Yang
@LijieyYang
CS Undergrad @CarnegieMellon, incoming CS PhD @Princeton, doing research in ML and Systems
Officially graduated from @SCSatCMU 🎓(Allen Newell Award, Honorable Mention) and thrilled to be starting my PhD at @Princeton with Prof. Ravi Netravali 🚀! Huge thanks to my advisor Mark Stehlik, research mentors @JiaZhihao @tqchenml, and amazing CMU Catalyst collaborators!


Beyond Context Limits Subconscious Threads for Long-Horizon Reasoning
One of the best ways to reduce LLM latency is by fusing all computation and communication into a single GPU megakernel. But writing megakernels by hand is extremely hard. 🚀Introducing Mirage Persistent Kernel (MPK), a compiler that automatically transforms LLMs into optimized…
FlashInfer won #MLSys2025 best paper🏆, with backing from @NVIDIAAIDev to bring top LLM inference kernels to the community
🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and…
Huge thank you to @NVIDIADC for gifting a brand new #NVIDIADGX B200 to CMU’s Catalyst Research Group! This AI supercomputing system will afford Catalyst the ability to run and test their work on a world-class unified AI platform.
Attending #ASPLOS25 / #EuroSys25? Join our half-day tutorial tomorrow on Efficient Systems and Compilers for Generative AI! We will introduce our latest research: ✨ Mirage: auto-gen fast GPU kernels for LLMs directly from math definitions 💸 FlexLLM: memory-efficient LLM…
Excited to share my first work of graduate school! BARE is a novel method for generating diverse, high-quality synthetic datasets, leveraging the diversity of base models and quality of instruct-tuned models. Check out the thread and feel free to reach out to @pgasawa and myself!
Instruct-tuned models are getting better at following instructions and ‘reasoning’ every day, but they’re shockingly poor at generating diverse responses. Diversity is crucial to many tasks like synthetic data generation. We tackle this with a new approach, BARE 🐻! (1/n)
🚀Introducing TidalDecode: speed up #LLM decoding with position-persistent sparse attention. 9x faster than full attention with no accuracy loss. 🔑Key insight: consecutive LLM layers utilize similar key tokens for sparse attention. 🌟Our simple yet effective approach: select a…