Hanze Dong

@hendrydong

Research Scientist @MSFTResearch. Previously @SFResearch

Joined June 2011

472Following

495Followers

Pinned

Hanze Dong@hendrydong · May 29

I agree that having a consistent evaluation pipeline and better illustrating format and non-format gain are important, as we recently updated(x.com/ypwang61/statu……) But I disagree with some points in the blog for 1-shot RLVR. 1. For Deepseek-R1-Distill-Qwen-1.5B, we set…

SShashwat Goel@ShashwatGoel7 · May 29

Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇

6.0K

Pinned

Hanze Dong@hendrydong · May 29

I agree that having a consistent evaluation pipeline and better illustrating format and non-format gain are important, we also update the format reward baseline in 1-shot RLVR (x.com/ypwang61/statu…) to more fairly report non-format gain (and we see there is still nontrivial…

YYiping Wang@ypwang61 · May 27

We update an analysis about format correction in 1-shot RLVR which is asked frequently (thx for all feedback!) Summary: (1) Format correction indeed contributes a lot in RLVR (e.g., 18% -> 29% for Qwen2.5-Math-1.5B over 6 math tasks), in both full-set (1.2k data) and one-shot…

13.0K

Pinned

Hanze Dong@hendrydong · May 29

🤔

BBrowser Use@browser_use · May 29

Introducing: Vibetest-use MCP - automated QA using Browser-Use 🧨 In just one line👀 ⇢ 10+ browser-use agents spin up in parallel ⇢ crawl your dev site in < 60s ⇢ flag every 404, dead button & UI glitch Runs out-of-the-box in Cursor, Claude Code & Codex (soon?). 100% Open…

162

Hanze Dong@hendrydong · Jul 22

Excited to bring Qwen3-Coder into the browser and terminal world! Building the scaffolding and environments for this big guy to play and learn is tough but incredibly "rewarding". Agentic coding and browsing are arguably the two most important skills for digital agents: they…

QQwen@Alibaba_Qwen · Jul 22

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…

106

8.0K

Hanze Dong@hendrydong · Jul 3

GLM-4.1V-Thinking is awesome — delivering groundbreaking performance across benchmarks. 🚀🚀Thrilled to see our Elastic Reasoning strategy recognized and adopted!

AAK@_akhaliq · Jul 2

GLM-4.1V-Thinking Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

2.0K

Hanze Dong@hendrydong · Jun 23

I’m very honored to join the big project. 😆 This long movie benchmark is a must evaluation set if you claim your models are good at video understanding!

MManos Zaranis@ManosZaranis · Jun 23

🚨Meet MF²: Movie Facts & Fibs: a new benchmark for long-movie understanding! 🤔Do you think your model understands movies? Unlike existing benchmarks, MF² targets memorable events, emotional arcs 💔, and causal chains 🔗 — things humans recall easily, but even top models like…

726

Hanze Dong Retweeted

OpenAI@OpenAI · Jun 10

OpenAI o3-pro today.

1.0K

2.0K

19.0K

882

2.7M

Hanze Dong@hendrydong · Jun 10

Very excited to release our first reasoning model, Magistral. We released the weights of Magistral Small alongside a paper that presents our approach, online RL infrastructure, and findings.

MMistral AI@MistralAI · Jun 10

Announcing Magistral, our first reasoning model designed to excel in domain-specific, transparent, and multilingual reasoning.

502

132

50.0K

Hanze Dong@hendrydong · Jun 8

Excited to share that EmbodiedBench was selected for an Oral at ICML 2025! We recently added results for new models (InternVL3, Gemma3, Ovis2) and released a large agent trajectory dataset on 🤗: embodiedbench.github.io Try training and evaluating your MLLM for embodied agents!

RRui Yang@RuiYang70669025 · Feb 14

🤖Can MLLM agents reason about spatial relationships and plan atomic actions for navigation & manipulation? 🔥 Meet EmbodiedBench 🏆—the first fine-grained benchmark for MLLM-based embodied agents! 📄 Paper: arxiv.org/abs/2502.09560 🌐 Website & code: embodiedbench.github.io

12.0K

Hanze Dong Retweeted

Geoffrey Hinton@geoffreyhinton · Jun 7

Congratulations to @Yoshua_Bengio on launching @LawZero_ — a research effort to advance safe-by-design AI, especially as frontier systems begin to exhibit signs of self-preservation and deceptive behaviour.

166

2.0K

204

127.0K

Hanze Dong@hendrydong · Jun 4

Thanks for sharing our work！GUI-Actor is a new GUI grounding method that combines an attention-based action head with a grounding verifier, different from previous text-based coordinate prediction methods.

AAK@_akhaliq · Jun 4

Microsoft just dropped GUI-Actor on Hugging Face Coordinate-Free Visual Grounding for GUI Agents

1.0K

Hanze Dong Retweeted

Physical Intelligence@physical_int · May 28

We figured out how to train VLAs with diffusion outputs much faster (7.5x faster), inheriting better language following from the VLM, and leading to better results. The key: protect the VLM backbone during training with knowledge insulation. Let’s talk about what we learned👇

101

813

542

88.0K

Hanze Dong Retweeted

Shashwat Goel@ShashwatGoel7 · May 29

126

880

536

316.0K

Hanze Dong@hendrydong · May 29

simulate python interpreter ❌ python interpreter world model ✅ 😂

TTeknium (e/λ)@Teknium1 · May 29

Something interesting I just noticed in Hermes' cots during RL - It decided to simulate a python interpreter. The models yearn for the tools..

146

Hanze Dong@hendrydong · May 29

Impressive efficiency gains for Diffusion LLMs

EEnze Xie@xieenze_jr · May 29

🚀 Fast-dLLM: 27.6× Faster Diffusion LLMs with KV Cache & Parallel Decoding 💥 Key Features🌟 - Block-Wise KV Cache Reuses 90%+ attention activations via bidirectional caching (prefix/suffix), enabling 8.1×–27.6× throughput gains with <2% accuracy loss 🔄 -…

568

Hanze Dong@hendrydong · May 28

🚀Excited to share our new LLM math reasoning work! 🔥Supervised learning (as a replacement for RL) can reach SoTA performance on LLM math reasoning! 📊

YYin Cui@YinCuiCV · May 28

Is self-improvement exclusive to RL? Can we use supervised learning to match LLMs trained with SOTA RL algorithms? In Negative-aware Fine-Tuning (NFT), we introduce a purely supervised learning method to enhance LLMs' math reasoning with no external teachers. NFT matches or…

1.0K

Hanze Dong@hendrydong · May 27

🚀 We've been exploring long CoT reasoning models for quite a while. Today, we're excited to share a systematic framework that redefines how to reason efficiently with LLMs: 📌 Fractured Sampling — a unified strategy for parallel thinking at inference time.

HHanze Dong@hendrydong · May 27

🚀 A unified strategy for parallel decoding: Fractured CoT Reasoning We explore three dims of sampling: - Reasoning trajectories - Final solutions per traj - Depth of reasoning Maximize accuracy-cost trade-off! Allocate computation for huge gains. Paper: arxiv.org/pdf/2505.12992

926

Hanze Dong Retweeted

Baohao Liao@baohao_liao · May 27

Our code is available at github.com/BaohaoLiao/fra…

382

Hanze Dong@hendrydong · May 27

🚨Wonder how parallel thinking works for #gemini pro 2.5? We too! Here is our preprint for exploring 3D thinking. A new training-free framework that boosts all pass@k, best-of-n and token efficiency for reasoning.

HHanze Dong@hendrydong · May 27

3.0K