Fan Zhou
@FaZhou_998
๐ OctoThinker / ๐ MegaMath / ๐ซ ProX. PhD Student at SJTU. Prev: Core member @XLangNLP, Intern @MSFTResearch.
๐Octothinker tech report is finally out! We also release the 70B math-focusing mid-training dataset -- MegaMath-Web-Pro-Max. Hope you'll find it useful!๐ค ๐ hf.co/datasets/OctoTโฆ huggingface.co/papers/2506.20โฆ
Say hi to ๐ OctoThinker โ our new mid-training efforts for building strong reasoning base models tailored for the RL scaling era. Still a WIP, but we're excited to share our early insights into rethinking base model development. ๐ Blog: tinyurl.com/OctoThinker ๐ค Huggingface:โฆ
MegaMath has been accepted to @COLM_conf 2025๐ฅณ Hoping you find our data useful!
๐ฅ๐ฅ Happy to share our latest efforts on math pre-training data, the MegaMath dataset! This is a 9-month project starting from 2024โs summer, and we finally deliver: the largest math pre-training data to date containing ๐ฅ370B ๐ฅtokens of web, code, and synthetic data!
>>> Qwen3-Coder is here! โ Weโre releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achievesโฆ
After three intense months of hard work with the team, we made it! We hope this release can help drive the progress of Coding Agents. Looking forward to seeing Qwen3-Coder continue creating new possibilities across the digital world!
>>> Qwen3-Coder is here! โ Weโre releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achievesโฆ
this is what is not small! boys spent so much time building the Qwen3-Coder after Qwen2.5-Coder. it is much bigger, but based on MoE, and way stronger and smarter than before! not sure we can say competitive with claude sonnet 4 but might be for sure a really good coding agent.โฆ
>>> Qwen3-Coder is here! โ Weโre releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achievesโฆ
Apart from the performance, itโs pure entertainment just watching Qwen3โCoder build QwenโฏCode all by itself. Agentic coding is really something: it explores, understands, plans, and acts seamlessly. Honored to be โin the gameโโeven if my entire work so far is smashing the Enterโฆ
>>> Qwen3-Coder is here! โ Weโre releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achievesโฆ
Excited to share DreamOnโour latest work teaching diffusion LMs to dynamically expand and contract beyond fixed-size canvases!
We present DreamOn: a simple yet effective method for variable-length generation in diffusion language models. Our approach boosts code infilling performance significantly and even catches up with oracle results.
Really appreciate this thread! One perspective from RL for LLM reasoning: domains with heavier pretraining exposure (e.g., math, code, science) tend to show significantly stronger cross-domain transfer than less-seen ones (like logic, simulation, tabular). More context here if ofโฆ
Attending #ICML2025 ๐จ๐ฆ this week! Will be presenting Aguvis (arxiv.org/abs/2412.04454) on July 15 at 11am, and joining Computer Use Agent Workshop @workshopcua on July 19. If youโre into digital agent research, especially around computer/browser use, letโs grab a coffee!
Introducing SmolLM3: a strong, smol reasoner! > SoTA 3B model > dual mode reasoning (think/no_think) > long context, up to 128k > multilingual: en, fr, es, de, it, pt > fully open source (data, code, recipes) huggingface.co/blog/smollm3
๐คCan diffusion models write code competitively? Excited to share our latest 7B coding diffusion LLM!!๐ป With DiffuCoder, we explore how they decode, why temperature๐ฅ matters, and how to improve them via coupled-GRPO that speaks diffusion!!๐ Code: github.com/apple/ml-diffuโฆ ๐งต
๐ตโ๐ซ Struggling with ๐๐ข๐ง๐-๐ญ๐ฎ๐ง๐ข๐ง๐ ๐๐จ๐? Meet ๐๐๐ง๐ฌ๐๐๐ข๐ฑ๐๐ซ โ an MoE post-training method that offers more ๐ฉ๐ซ๐๐๐ข๐ฌ๐ ๐ซ๐จ๐ฎ๐ญ๐๐ซ ๐ ๐ซ๐๐๐ข๐๐ง๐ญ, making MoE ๐๐๐ฌ๐ข๐๐ซ ๐ญ๐จ ๐ญ๐ซ๐๐ข๐ง and ๐๐๐ญ๐ญ๐๐ซ ๐ฉ๐๐ซ๐๐จ๐ซ๐ฆ๐ข๐ง๐ ! Blog: fengyao.notion.site/moe-posttrainiโฆโฆ
What foundation models do we REALLY need for the RL era? And what pre-training data? Excited to share our work: OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling arxiv.org/pdf/2506.20512 โจ Key breakthroughs: - First RL-focused mid-training approach - Llamaโฆ
What Makes a Base Language Model Suitable for RL? Rumors in the community say RL (i.e., RLVR) on LLMs is full of โmysteriesโ: (1) Is the magic only happening on Qwen + Math? (2) Does the "aha moment" only spark during math reasoning? (3) Is evaluation hiding some tricky traps?โฆ
We have finally released the ๐paper for ๐ฅFineWeb2, our large multilingual pre-training dataset. Along with general (and exhaustive) multilingual work, we introduce a concept that can also improve English performance: deduplication-based upsampling, which we call rehydration.
DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation Apple introduces DiffuCoder, a 7B diffusion LLM trained on 130B tokens of code authors also propose a diffusion-native RL training framework, coupled-GRPO Decoding of dLLMs differ fromโฆ
What Makes a Base Language Model Suitable for RL? Rumors in the community say RL (i.e., RLVR) on LLMs is full of โmysteriesโ: (1) Is the magic only happening on Qwen + Math? (2) Does the "aha moment" only spark during math reasoning? (3) Is evaluation hiding some tricky traps?โฆ
Say hi to ๐ฎMegaMath-Pro-Max High-quality corpora are vital for mid-training. When it comes to the math domain? Let me tell you the behind recipe. 1. Curating Pipeline Step 1: uniformly and randomly sample millions of documents from the MegaMath-Web corpus, stratified byโฆ
OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling "we investigate how mid-training strategies shape RL dynamics, focusing on two representative model families: Qwen and Llama." "we introduce a two-stage mid-training strategy, Stable-then-Decay, in which baseโฆ
# ๐จ 4B open-recipe model beats Claude-4-Opus ๐ 100% open data, recipe, model weights and code. Introducing Polarisโจ--a post-training recipe for scaling RL on advanced reasoning models. ๐ฅณ Check out how we boost open-recipe reasoning models to incredible performance levelsโฆ