Fan Zhou (@FaZhou_998)

Pinned

F

Fan Zhou@FaZhou_998 · Jun 26

🐙Octothinker tech report is finally out! We also release the 70B math-focusing mid-training dataset -- MegaMath-Web-Pro-Max. Hope you'll find it useful!🤗 👇 hf.co/datasets/OctoT… huggingface.co/papers/2506.20…

FFan Zhou@FaZhou_998 · Apr 24

Say hi to 🐙 OctoThinker — our new mid-training efforts for building strong reasoning base models tailored for the RL scaling era. Still a WIP, but we're excited to share our early insights into rethinking base model development. 📖 Blog: tinyurl.com/OctoThinker 🤗 Huggingface:…

3

19

70

20

8.0K

Pinned

F

Fan Zhou@FaZhou_998 · Jul 8

MegaMath has been accepted to @COLM_conf 2025🥳 Hoping you find our data useful!

FFan Zhou@FaZhou_998 · Apr 7

🥁🥁 Happy to share our latest efforts on math pre-training data, the MegaMath dataset! This is a 9-month project starting from 2024’s summer, and we finally deliver: the largest math pre-training data to date containing 💥370B 💥tokens of web, code, and synthetic data!

0

10

81

11

5.0K

Fan Zhou Retweeted

Q

Qwen@Alibaba_Qwen · Jul 22

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…

266

1.0K

9.0K

4.0K

1.8M

F

Fan Zhou@FaZhou_998 · Jul 22

After three intense months of hard work with the team, we made it! We hope this release can help drive the progress of Coding Agents. Looking forward to seeing Qwen3-Coder continue creating new possibilities across the digital world!

QQwen@Alibaba_Qwen · Jul 22

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…

58

83

939

109

60.0K

F

Fan Zhou@FaZhou_998 · Jul 22

this is what is not small! boys spent so much time building the Qwen3-Coder after Qwen2.5-Coder. it is much bigger, but based on MoE, and way stronger and smarter than before! not sure we can say competitive with claude sonnet 4 but might be for sure a really good coding agent.…

QQwen@Alibaba_Qwen · Jul 22

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…

58

78

989

103

56.0K

F

Fan Zhou@FaZhou_998 · Jul 23

Apart from the performance, it’s pure entertainment just watching Qwen3‑Coder build Qwen Code all by itself. Agentic coding is really something: it explores, understands, plans, and acts seamlessly. Honored to be “in the game”—even if my entire work so far is smashing the Enter…

QQwen@Alibaba_Qwen · Jul 22

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…

2

10

44

7

4.0K

F

Fan Zhou@FaZhou_998 · Jul 15

Excited to share DreamOn—our latest work teaching diffusion LMs to dynamically expand and contract beyond fixed-size canvases!

ZZirui Wu @ACL2025 🇦🇹@WilliamZR7 · Jul 15

We present DreamOn: a simple yet effective method for variable-length generation in diffusion language models. Our approach boosts code infilling performance significantly and even catches up with oracle results.

0

7

26

4

2.0K

Fan Zhou Retweeted

Z

Zhoujun (Jorge) Cheng@ChengZhoujun · Jul 15

Really appreciate this thread! One perspective from RL for LLM reasoning: domains with heavier pretraining exposure (e.g., math, code, science) tend to show significantly stronger cross-domain transfer than less-seen ones (like logic, simulation, tabular). More context here if of…

0

2

13

3

618

Fan Zhou Retweeted

Y

Yiheng Xu@yihengxu_ · Jul 14

Attending #ICML2025 🇨🇦 this week! Will be presenting Aguvis (arxiv.org/abs/2412.04454) on July 15 at 11am, and joining Computer Use Agent Workshop @workshopcua on July 19. If you’re into digital agent research, especially around computer/browser use, let’s grab a coffee!

1

5

52

1

4.0K

Fan Zhou Retweeted

L

Loubna Ben Allal@LoubnaBenAllal1 · Jul 8

Introducing SmolLM3: a strong, smol reasoner! > SoTA 3B model > dual mode reasoning (think/no_think) > long context, up to 128k > multilingual: en, fr, es, de, it, pt > fully open source (data, code, recipes) huggingface.co/blog/smollm3

68

204

1.0K

524

266.0K

Fan Zhou Retweeted

S

Sansa Gong@sansa19739319 · Jul 2

🤖Can diffusion models write code competitively? Excited to share our latest 7B coding diffusion LLM!!💻 With DiffuCoder, we explore how they decode, why temperature🔥 matters, and how to improve them via coupled-GRPO that speaks diffusion!!📈 Code: github.com/apple/ml-diffu… 🧵

5

112

584

378

45.0K

Fan Zhou Retweeted

F

Feng Yao@fengyao1909 · Jul 1

😵‍💫 Struggling with 𝐟𝐢𝐧𝐞-𝐭𝐮𝐧𝐢𝐧𝐠 𝐌𝐨𝐄? Meet 𝐃𝐞𝐧𝐬𝐞𝐌𝐢𝐱𝐞𝐫 — an MoE post-training method that offers more 𝐩𝐫𝐞𝐜𝐢𝐬𝐞 𝐫𝐨𝐮𝐭𝐞𝐫 𝐠𝐫𝐚𝐝𝐢𝐞𝐧𝐭, making MoE 𝐞𝐚𝐬𝐢𝐞𝐫 𝐭𝐨 𝐭𝐫𝐚𝐢𝐧 and 𝐛𝐞𝐭𝐭𝐞𝐫 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐢𝐧𝐠! Blog: fengyao.notion.site/moe-posttraini……

4

55

225

160

37.0K

F

Fan Zhou@FaZhou_998 · Jun 27

What foundation models do we REALLY need for the RL era? And what pre-training data? Excited to share our work: OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling arxiv.org/pdf/2506.20512 ✨ Key breakthroughs: - First RL-focused mid-training approach - Llama…

ZZengzhi Wang@SinclairWang1 · Jun 26

What Makes a Base Language Model Suitable for RL? Rumors in the community say RL (i.e., RLVR) on LLMs is full of “mysteries”: (1) Is the magic only happening on Qwen + Math? (2) Does the "aha moment" only spark during math reasoning? (3) Is evaluation hiding some tricky traps?…

0

10

75

35

8.0K

Fan Zhou Retweeted

G

Guilherme Penedo@gui_penedo · Jun 27

We have finally released the 📝paper for 🥂FineWeb2, our large multilingual pre-training dataset. Along with general (and exhaustive) multilingual work, we introduce a concept that can also improve English performance: deduplication-based upsampling, which we call rehydration.

8

96

427

237

74.0K

Fan Zhou Retweeted

T

Tanishq Abraham is at ICML@iScienceLuvr · Jun 26

DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation Apple introduces DiffuCoder, a 7B diffusion LLM trained on 130B tokens of code authors also propose a diffusion-native RL training framework, coupled-GRPO Decoding of dLLMs differ from…

4

73

298

169

26.0K

Fan Zhou Retweeted

Z

Zengzhi Wang@SinclairWang1 · Jun 26

What Makes a Base Language Model Suitable for RL? Rumors in the community say RL (i.e., RLVR) on LLMs is full of “mysteries”: (1) Is the magic only happening on Qwen + Math? (2) Does the "aha moment" only spark during math reasoning? (3) Is evaluation hiding some tricky traps?…

10

90

506

476

89.0K

Fan Zhou Retweeted

Z

Zengzhi Wang@SinclairWang1 · Jun 26

Say hi to 🔮MegaMath-Pro-Max High-quality corpora are vital for mid-training. When it comes to the math domain? Let me tell you the behind recipe. 1. Curating Pipeline Step 1: uniformly and randomly sample millions of documents from the MegaMath-Web corpus, stratified by…

0

17

75

28

5.0K

Fan Zhou Retweeted

T

Tanishq Abraham is at ICML@iScienceLuvr · Jun 26

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling "we investigate how mid-training strategies shape RL dynamics, focusing on two representative model families: Qwen and Llama." "we introduce a two-stage mid-training strategy, Stable-then-Decay, in which base…

5

64

331

201

21.0K

Fan Zhou Retweeted

C

Chenxin An@AnChancy46881 · Jun 20

# 🚨 4B open-recipe model beats Claude-4-Opus 🔓 100% open data, recipe, model weights and code. Introducing Polaris✨--a post-training recipe for scaling RL on advanced reasoning models. 🥳 Check out how we boost open-recipe reasoning models to incredible performance levels…

24

82

447

398

96.0K