Feng Yao
@fengyao1909
Ph.D. student @UCSD_CSE | Intern @Amazon Rufus Foundation Model Ex. @MSFTResearch @TsinghuaNLP
๐ตโ๐ซ Struggling with ๐๐ข๐ง๐-๐ญ๐ฎ๐ง๐ข๐ง๐ ๐๐จ๐? Meet ๐๐๐ง๐ฌ๐๐๐ข๐ฑ๐๐ซ โ an MoE post-training method that offers more ๐ฉ๐ซ๐๐๐ข๐ฌ๐ ๐ซ๐จ๐ฎ๐ญ๐๐ซ ๐ ๐ซ๐๐๐ข๐๐ง๐ญ, making MoE ๐๐๐ฌ๐ข๐๐ซ ๐ญ๐จ ๐ญ๐ซ๐๐ข๐ง and ๐๐๐ญ๐ญ๐๐ซ ๐ฉ๐๐ซ๐๐จ๐ซ๐ฆ๐ข๐ง๐ ! Blog: fengyao.notion.site/moe-posttrainiโฆโฆ

[Implicit Personalization of #LLMs] How do we answer the question "What colo(u)r is a football?" Answer 1: "Brown๐ ". Answer 2: "Black and whiteโฝ". We propose a #Causal framework to test if LLMs adjust its answers depending on the cultural background inferred from the question.โฆ
Feel free to join if youโre interested! This will be my first public talk on DenseMixer โ I was previously invited by the Qwen Team to give a private talk, and now excited to share it more broadly at Cohere.
Be sure to join our Regional Asia group next week on July 23rd as they sit down with @fengyao1909 for an insightful session on "DenseMixer: Improving MoE Post-Training with Precise Router Gradient" Thanks to @KanwalMehreen2 and @AhmadMustafaAn1 for organizing this session! ๐
What happend after Dream 7B? First, Dream-Coder 7B: A fully open diffusion LLM for code delivering strong performance, trained exclusively on public data. Plus, DreamOn cracks the variable-length generation problem! It enables code infilling that goes beyond a fixed canvas.
๐ Attention is the bottleneck in video DiTsโ5โฏs of 720p = 100K+ tokens, quadratic cost blows up fast. Sparse/linear attention is ๐ for long-context world models. ๐ง Track relavent papers in our awsome-video-attention repo โ github.com/hao-ai-lab/Awsโฆ #WorldModel #VideoAI
My brilliant friend just got an ICML 2025 oral for this wild and elegant theory connecting double descent, grokking, and prime numbers ๐คฏ Heโs on the job market โ if youโre hiring, donโt miss out. Feel free to reach out!
Our new ICML 2025 oral paper proposes a new unified theory of both Double Descent and Grokking, revealing that both of these deep learning phenomena can be understood as being caused by prime numbers in the network parameters ๐คฏ๐คฏ ๐งต[1/8]
Glad to see DensenMixer has been integrated into Axolotl, an fully open source framework for post-training various LLMs! Welcome to try!
Axolotl v0.11.0 is out! We've included ALST's TiledMLP for increased long sequence length training as well as support for Devstral, DenseMixer (MoE performance), and support for the most recent releases of transformers 4.53.1, accelerate 1.8.1, and FlashAttention2.
๐ New release for the Phi family! **SlimMOE** (arxiv.org/abs/2506.18349) trims bulky ฮฆ-3.5-MoE experts into agile models (4-6ร smaller) with MINIMAL accuracy loss. If you โค๏ธ Phi-3 mini/small, youโll love these lighter siblings.๐
Training mixtureโofโexperts models hits a wall at the router. This new technique DenseMixer, fixes that, by trading one extra forward pass on inactive experts for precise router gradient DenseMixer sends each training token through every expert once, then treats the routerโsโฆ
๐ตโ๐ซ Struggling with ๐๐ข๐ง๐-๐ญ๐ฎ๐ง๐ข๐ง๐ ๐๐จ๐? Meet ๐๐๐ง๐ฌ๐๐๐ข๐ฑ๐๐ซ โ an MoE post-training method that offers more ๐ฉ๐ซ๐๐๐ข๐ฌ๐ ๐ซ๐จ๐ฎ๐ญ๐๐ซ ๐ ๐ซ๐๐๐ข๐๐ง๐ญ, making MoE ๐๐๐ฌ๐ข๐๐ซ ๐ญ๐จ ๐ญ๐ซ๐๐ข๐ง and ๐๐๐ญ๐ญ๐๐ซ ๐ฉ๐๐ซ๐๐จ๐ซ๐ฆ๐ข๐ง๐ ! Blog: fengyao.notion.site/moe-posttrainiโฆโฆ
Take a look at Mirage๐our new work on real-time neural game engine!
๐ตโ๐ซ Struggling with ๐๐ข๐ง๐-๐ญ๐ฎ๐ง๐ข๐ง๐ ๐๐จ๐? Meet ๐๐๐ง๐ฌ๐๐๐ข๐ฑ๐๐ซ โ an MoE post-training method that offers more ๐ฉ๐ซ๐๐๐ข๐ฌ๐ ๐ซ๐จ๐ฎ๐ญ๐๐ซ ๐ ๐ซ๐๐๐ข๐๐ง๐ญ, making MoE ๐๐๐ฌ๐ข๐๐ซ ๐ญ๐จ ๐ญ๐ซ๐๐ข๐ง and ๐๐๐ญ๐ญ๐๐ซ ๐ฉ๐๐ซ๐๐จ๐ซ๐ฆ๐ข๐ง๐ ! Blog: fengyao.notion.site/moe-posttrainiโฆโฆ
A differentiable, semi-full-param training for MoEs. More FLOPs, comparable speed, higher and more predictable perf gain. I think it's a big deal. Academic research is stuck with finetuning dense models due to MoE complexity overhead. This is NGMI in the inference-dominant era.
๐ตโ๐ซ Struggling with ๐๐ข๐ง๐-๐ญ๐ฎ๐ง๐ข๐ง๐ ๐๐จ๐? Meet ๐๐๐ง๐ฌ๐๐๐ข๐ฑ๐๐ซ โ an MoE post-training method that offers more ๐ฉ๐ซ๐๐๐ข๐ฌ๐ ๐ซ๐จ๐ฎ๐ญ๐๐ซ ๐ ๐ซ๐๐๐ข๐๐ง๐ญ, making MoE ๐๐๐ฌ๐ข๐๐ซ ๐ญ๐จ ๐ญ๐ซ๐๐ข๐ง and ๐๐๐ญ๐ญ๐๐ซ ๐ฉ๐๐ซ๐๐จ๐ซ๐ฆ๐ข๐ง๐ ! Blog: fengyao.notion.site/moe-posttrainiโฆโฆ
๐Octothinker tech report is finally out! We also release the 70B math-focusing mid-training dataset -- MegaMath-Web-Pro-Max. Hope you'll find it useful!๐ค ๐ hf.co/datasets/OctoTโฆ huggingface.co/papers/2506.20โฆ
Say hi to ๐ OctoThinker โ our new mid-training efforts for building strong reasoning base models tailored for the RL scaling era. Still a WIP, but we're excited to share our early insights into rethinking base model development. ๐ Blog: tinyurl.com/OctoThinker ๐ค Huggingface:โฆ