YIFENG LIU
@YIFENGLIU_AI
CS Ph.D. student on LLM @ UCLA AGI Lab. Previous works: RPG, MARS, TPA, Kimi-1.5....
Why does CANADA try to prevent AI researchers from attending conferences in Cadana? I doubt whether Canada wants to develop their AI industry. Why does CANADA try to prevent ones named for maple to enter Canada? I doubt whether Canadians love maples.
Which optimizer (from 100+ optimizers for DL models) is best for training Large Language Models? 🤔 github.com/lauyikfung/A-S…

The only thing that’s certain is that MLA has been abandoned for good reason. People should be using TPA instead.
Qwen-3-MoE vs DeepSeek V2 (original) their designs are superficially similar – but different This will be a very interesting test of a few scaling laws
🚀 Introducing MARS v2: Make Variance Reduction Shine! Tired of AdamW being the default optimizer for training large models? What if variance reduction could finally outperform it? MARS is here to change the game! Now with its 2nd version, we’ve refined its core idea and…
Scaling law is good, but anti-scaling-law research is the future. arxiv.org/abs/2501.12948
🚀 Introducing Kimi k1.5 --- an o1-level multi-modal model -Sota short-CoT performance, outperforming GPT-4o and Claude Sonnet 3.5 on 📐AIME, 📐MATH-500, 💻 LiveCodeBench by a large margin (up to +550%) -Long-CoT performance matches o1 across multiple modalities (👀MathVista,…
12/ Joint work with @yifan_zhang_, @YIFENGLIU_AI, @HuizhuoY, Zhen Qin, Yang Yuan, @QuanquanGu, and Andrew Chi-Chih Yao. Incredible work by an outstanding team!
11/ Closing Remarks: “Tensor Product Attention Is All You Need” reforms attention as dynamic, low-rank factorization. If you need to push context lengths or want more efficient large language models, TPA is your solution. Check out our code at: github.com/tensorgi/T6.