Zhipeng Huang

@nopainkiller

Open Source Effectively Accelerates !

33.642931,-117.84131

Joined May 2009

506Following

1KFollowers

Pinned

Zhipeng Huang@nopainkiller · Dec 30

the first blog deals with compression, which largely refers to artificial intelligence. I share eight high level trends that I think the field is heading into post-2025, to make the bioverse compression doable. Read the blog here progverse.substack.com/p/towards-an-e… 3/n🧵

6.0K

Pinned

Zhipeng Huang Retweeted

Alpin@AlpinDale · Jul 26

PR here: github.com/astral-sh/uv/p…

464

Pinned

Zhipeng Huang@nopainkiller · Jul 25

joke of the day

YYiping Lu@2prime_PKU · Jul 25

Anyone knows adam?

318

Pinned

Zhipeng Huang Retweeted

PapersAnon@papers_anon · Jul 25

Group Sequence Policy Optimization From Qwen team. Reinforcement learning algorithm that defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. Links below

4.0K

Zhipeng Huang@nopainkiller · 2 h

Really decent work! Actually feel bad that rare people pays attention to this one.

NNate Chen@nathancgy4 · 7 h

Why do FFNs use ReLU instead of more precise ones like Exp? "We propose the following hypothesis: A kernel with lower retrieval precision encourages a more polysemantic key–value memory: multiple unrelated facts can be stored under the same key space" Great and inspiring read!

304

Zhipeng Huang@nopainkiller · 12 h

finally out !

ZZeyuan Allen-Zhu, Sc.D.@ZeyuanAllenZhu · 12 h

Phase 1 of Physics of Language Models code release ✅our Part 3.1 + 4.1 = all you need to pretrain strong 8B base model in 42k GPU-hours ✅Canon layers = strong, scalable gains ✅Real open-source (data/train/weights) ✅Apache 2.0 license (commercial ok!) 🔗github.com/facebookresear…

221

Zhipeng Huang@nopainkiller · 18 h

The new qwen3-235b-a22b-thinking scores 38.9% on WeirdML, putting it between flash-2.5 and grok-3-mini. That makes the third very solid qwen-3 model released in a week or so, all of them basically at the frontier for their cost.

HHåvard Ihle@htihle · Jun 27

WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two…

357

Zhipeng Huang@nopainkiller · 12 h

talked to one of the RL lead S1 yesterday, there were great effort went into multi-agent RL, and they internally tested with IMO (great result but cannot disclose)

�𝚐𝔪𝟾𝚡𝚡𝟾@gm8xx8 · 16 h

𝐍𝐄𝐖 𝐈𝐍𝐓𝐄𝐑𝐍 𝐌𝐎𝐃𝐄𝐋 Intern-S1 is a 241B open multimodal model composed of a 235B MoE language model and a 6B vision encoder, trained on 5T tokens with over half from scientific domains. It supports unified text, image, and video reasoning, features a dynamic tokenizer…

564

Zhipeng Huang Retweeted

niplav@niplav_site · 21 h

Some ideas on how one could handle mildly superhumanly persuasive AI systems: niplav.site/persuasion

419

Zhipeng Huang@nopainkiller · 24 h

you might find this one interesting as well, Sholto x.com/adamlewisgreen…

AAdam Green@adamlewisgreen · Nov 3

If you're unfamiliar with biology and want to get up to speed, my earlier essay provides a whirlwind tour of progress in the field over the past few decades, with an eye toward what role machine learning will play in in its future. markovbio.github.io/biomedical-pro…

331

Zhipeng Huang Retweeted

vLLM@vllm_project · Jul 26

This amazing Attention-FFN disaggregation implementation from @StepFun_ai , achieves decoding throughput of up to 4,039 tokens per second per GPU under 50ms TPOT SLA, for their 321B-A38B MoE model Step3 served with H800! The implementation is based on vLLM, and we are working…

327

186

27.0K

Zhipeng Huang Retweeted

Jake Wintermute 🧬/acc@SynBio1 · Jul 25

You can just goo things

223

10.0K

Zhipeng Huang Retweeted

Simo Ryu@cloneofsimo · Jul 25

Here is what people mean by "residual network makes your gradient happy" + intuition on depth muP Gradients vanish if activations are not unit-scaled. But thats not issue if you are using residual connection! But if you don't scale down branch, your activations / backward blow…

198

165

11.0K

Zhipeng Huang Retweeted

Stephan Baasch@stbaasch · Jul 25

Evaluating LLMs in Physik. arxiv.org/abs/2507.15550

242

Zhipeng Huang Retweeted

Tilde@tilderesearch · Jul 25

Read the full post here: tilderesearch.com/blog/momoe Mo' Money, Mo(E)' Problems

1.0K

Zhipeng Huang Retweeted

Lisan al Gaib@scaling01 · Jul 25

Scaling Laws for Efficient MoEs arxiv.org/pdf/2507.17702 EL (Efficiency Leverage) is the ratio of the computational costs of a dense model and its loss/performance equivalent MoE counterpart example: EL = 5 -> MoE needs only 1/5th of the FLOPS to reach same loss - Sparsity…

8.0K

Zhipeng Huang Retweeted

Parachutes@mingwuzheng · Jun 17

Why transformers excessively attend to the first token? Our new hypothesis: it's the model's way to perform a context-dependent identity operation. We provide strong empirical evidence supporting this explanation. Read more here: quilted-agreement-28c.notion.site/Why-Does-Atten… #AI #NLP #Transformer

2.0K

Zhipeng Huang Retweeted

You Jiacheng@YouJiacheng · Jul 24

"lack of novelty" reviews are harmful. Authors are afraid of "lack of novelty" and avoid discussion. FOG use the same post-norm + QKNorm recipe as OLMo 2 to stablize training, but don't discuss it -- only a table in appendix. (Note: they do discuss OP) They could have been frank.

2.0K

Zhipeng Huang@nopainkiller · Jul 22

black hole simulator by qwen3-coder, unsure about the physics behind but looks pretty cool

QQwen@Alibaba_Qwen · Jul 22

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…

6.0K