Zhipeng Huang
@nopainkiller
Open Source Effectively Accelerates !
the first blog deals with compression, which largely refers to artificial intelligence. I share eight high level trends that I think the field is heading into post-2025, to make the bioverse compression doable. Read the blog here progverse.substack.com/p/towards-an-e… 3/n🧵
Group Sequence Policy Optimization From Qwen team. Reinforcement learning algorithm that defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. Links below
Really decent work! Actually feel bad that rare people pays attention to this one.
Why do FFNs use ReLU instead of more precise ones like Exp? "We propose the following hypothesis: A kernel with lower retrieval precision encourages a more polysemantic key–value memory: multiple unrelated facts can be stored under the same key space" Great and inspiring read!
finally out !
Phase 1 of Physics of Language Models code release ✅our Part 3.1 + 4.1 = all you need to pretrain strong 8B base model in 42k GPU-hours ✅Canon layers = strong, scalable gains ✅Real open-source (data/train/weights) ✅Apache 2.0 license (commercial ok!) 🔗github.com/facebookresear…
The new qwen3-235b-a22b-thinking scores 38.9% on WeirdML, putting it between flash-2.5 and grok-3-mini. That makes the third very solid qwen-3 model released in a week or so, all of them basically at the frontier for their cost.
WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two…
talked to one of the RL lead S1 yesterday, there were great effort went into multi-agent RL, and they internally tested with IMO (great result but cannot disclose)
𝐍𝐄𝐖 𝐈𝐍𝐓𝐄𝐑𝐍 𝐌𝐎𝐃𝐄𝐋 Intern-S1 is a 241B open multimodal model composed of a 235B MoE language model and a 6B vision encoder, trained on 5T tokens with over half from scientific domains. It supports unified text, image, and video reasoning, features a dynamic tokenizer…
Some ideas on how one could handle mildly superhumanly persuasive AI systems: niplav.site/persuasion
you might find this one interesting as well, Sholto x.com/adamlewisgreen…
If you're unfamiliar with biology and want to get up to speed, my earlier essay provides a whirlwind tour of progress in the field over the past few decades, with an eye toward what role machine learning will play in in its future. markovbio.github.io/biomedical-pro…
This amazing Attention-FFN disaggregation implementation from @StepFun_ai , achieves decoding throughput of up to 4,039 tokens per second per GPU under 50ms TPOT SLA, for their 321B-A38B MoE model Step3 served with H800! The implementation is based on vLLM, and we are working…
Here is what people mean by "residual network makes your gradient happy" + intuition on depth muP Gradients vanish if activations are not unit-scaled. But thats not issue if you are using residual connection! But if you don't scale down branch, your activations / backward blow…
Read the full post here: tilderesearch.com/blog/momoe Mo' Money, Mo(E)' Problems
Scaling Laws for Efficient MoEs arxiv.org/pdf/2507.17702 EL (Efficiency Leverage) is the ratio of the computational costs of a dense model and its loss/performance equivalent MoE counterpart example: EL = 5 -> MoE needs only 1/5th of the FLOPS to reach same loss - Sparsity…
Why transformers excessively attend to the first token? Our new hypothesis: it's the model's way to perform a context-dependent identity operation. We provide strong empirical evidence supporting this explanation. Read more here: quilted-agreement-28c.notion.site/Why-Does-Atten… #AI #NLP #Transformer
"lack of novelty" reviews are harmful. Authors are afraid of "lack of novelty" and avoid discussion. FOG use the same post-norm + QKNorm recipe as OLMo 2 to stablize training, but don't discuss it -- only a table in appendix. (Note: they do discuss OP) They could have been frank.
black hole simulator by qwen3-coder, unsure about the physics behind but looks pretty cool
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…