Yufan Zhuang
@yufan_zhuang
ai researcher | research intern @Apple siri | phd student @UCSanDiego | prev @AMD @Meta @MSFTResearch @IBMResearch
🤯Your LLM just threw away 99.9 % of what it knows. Standard decoding samples one token at a time and discards the rest of the probability mass. Mixture of Inputs (MoI) rescues that lost information, feeding it back for more nuanced expressions. It is a brand new…

Muon is still more unstable than AdamW it seems, has to use a softcliping to make it work, like qk-norm. While scaling up, we encountered a persistent challenge: training instability caused by exploding attention logits, an issue that occurs more frequently with Muon but less…
🚨1T LLM released by MoonShot-AI, the largest yet. Kimi-K2, 384 * 32B MoE, 8*32B active / 1 shared expert. Context length 128K, MLA attention. Outperforms DeepSeek-V3, nearly twice its size.
We have a full-time position for research scientist in our team at #Apple. The topic is understanding and improving #reasoning abilities of #LLMs. We're also interested in developing new and efficient architectures based on transformer for language modeling, again reasoning…
🧵 1/8 The Illusion of Thinking: Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really "thinking"? 🤔 Or are they just throwing more compute towards pattern matching? The new Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks,…
Can AI file your taxes? Not yet. 😅 interesting benchmarks arxiv.org/pdf/2507.16126

Proud to introduce Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant RL algorithm that powers the large-scale RL training of the latest Qwen3 models (Instruct, Coder, Thinking) 🚀 📄 huggingface.co/papers/2507.18…
I'm hiring for our AI safety team at xAI! We urgently need strong engineers/researchers to work across all stages of the the frontier AI development cycle: data, training, evals, and product 1. job-boards.greenhouse.io/xai/jobs/47992… 2. job-boards.greenhouse.io/xai/jobs/47992…
HLE has recently become the benchmark to beat for frontier agents. We @FutureHouseSF took a closer look at the chem and bio questions and found about 30% of them are likely invalid based on our analysis and third-party PhD evaluations. 1/7
In this report we describe the 2025 Apple Foundation Models ("AFM"). We also introduce the new Foundation Models framework, which gives app developers direct access to the on-device AFM model. machinelearning.apple.com/research/apple…
Just arrived at ICML. Please drop me a message if you are here and like to chat. We are hiring.
Excited to present our #ICML2025 paper ActionPiece this *Thursday afternoon*! Come to our poster and let's chat about #Tokenization and Recommendation! 🗓 Thu, July 17 | 🕟 4:30–7:00 PM PDT 📍 East Exhibition Hall A-B, Poster # E-2209
Did you know tokenization for generative recommendation today looks a lot like LLM tokenization did *10 years* ago? Meet ActionPiece, our #ICML2025 Spotlight paper, the first context-aware action tokenizer. 1/5 🧵
Big news: we've figured out how to make a *universal* reward function that lets you apply RL to any agent with: - no labeled data - no hand-crafted reward functions - no human feedback! A 🧵 on RULER
new blog: How to scale RL to 10^26 FLOPs everyone is trying to figure out the right way to scale reasoning with RL ilya compared the Internet to fossil fuel: it may be the only useful data we have. and it's expendable perhaps we should learn to reason from The Internet (not…
Applications are now open for our fall student programs.
The bottleneck in AI isn't just compute - it's access to diverse, high-quality data, much of which is locked away due to privacy, legal, or competitive concerns. What if there was a way to train better models collaboratively, without actually sharing your data? Introducing…
Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵
I always think SSM+Transformer Hybrid Arch has a huge potential especially for long-context use-cases, the speed advantage from SSM is incomparable
Now live. A new update to our Jamba open model family 🎉 Same hybrid SSM-Transformer architecture, 256K context window, efficiency gains & open weights. Now with improved grounding & instruction following. Try it on AI21 Studio or download from @huggingface 🤗 More on what…
usually this is called 128K context window... doesn't matter surprised grok 4 didn't go for long-context as gemini did, it's key to large-scale repo-level coding (aka vibe coding)
Grok 4 is coming this week! Very exciting and with a 132k context window!
😵💫 Struggling with 𝐟𝐢𝐧𝐞-𝐭𝐮𝐧𝐢𝐧𝐠 𝐌𝐨𝐄? Meet 𝐃𝐞𝐧𝐬𝐞𝐌𝐢𝐱𝐞𝐫 — an MoE post-training method that offers more 𝐩𝐫𝐞𝐜𝐢𝐬𝐞 𝐫𝐨𝐮𝐭𝐞𝐫 𝐠𝐫𝐚𝐝𝐢𝐞𝐧𝐭, making MoE 𝐞𝐚𝐬𝐢𝐞𝐫 𝐭𝐨 𝐭𝐫𝐚𝐢𝐧 and 𝐛𝐞𝐭𝐭𝐞𝐫 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐢𝐧𝐠! Blog: fengyao.notion.site/moe-posttraini……
Did you know tokenization for generative recommendation today looks a lot like LLM tokenization did *10 years* ago? Meet ActionPiece, our #ICML2025 Spotlight paper, the first context-aware action tokenizer. 1/5 🧵
gonna come in handy for all the “implement kv-cache for me” interviews 😂
Since it's summer, and more or less internship and tech interview season, I made all 30 chapters of my Machine Learning Q and AI book freely available for the summer: sebastianraschka.com/books/ml-q-and… Hope it’s helpful! Happy reading, and good luck if you are interviewing!