Hanze Dong
@hendrydong
Research Scientist @MSFTResearch. Previously @SFResearch
I agree that having a consistent evaluation pipeline and better illustrating format and non-format gain are important, as we recently updated(x.com/ypwang61/statu……) But I disagree with some points in the blog for 1-shot RLVR. 1. For Deepseek-R1-Distill-Qwen-1.5B, we set…
Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇
I agree that having a consistent evaluation pipeline and better illustrating format and non-format gain are important, we also update the format reward baseline in 1-shot RLVR (x.com/ypwang61/statu…) to more fairly report non-format gain (and we see there is still nontrivial…
We update an analysis about format correction in 1-shot RLVR which is asked frequently (thx for all feedback!) Summary: (1) Format correction indeed contributes a lot in RLVR (e.g., 18% -> 29% for Qwen2.5-Math-1.5B over 6 math tasks), in both full-set (1.2k data) and one-shot…
🤔
Introducing: Vibetest-use MCP - automated QA using Browser-Use 🧨 In just one line👀 ⇢ 10+ browser-use agents spin up in parallel ⇢ crawl your dev site in < 60s ⇢ flag every 404, dead button & UI glitch Runs out-of-the-box in Cursor, Claude Code & Codex (soon?). 100% Open…
Excited to bring Qwen3-Coder into the browser and terminal world! Building the scaffolding and environments for this big guy to play and learn is tough but incredibly "rewarding". Agentic coding and browsing are arguably the two most important skills for digital agents: they…
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…
GLM-4.1V-Thinking is awesome — delivering groundbreaking performance across benchmarks. 🚀🚀Thrilled to see our Elastic Reasoning strategy recognized and adopted!
GLM-4.1V-Thinking Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
I’m very honored to join the big project. 😆 This long movie benchmark is a must evaluation set if you claim your models are good at video understanding!
🚨Meet MF²: Movie Facts & Fibs: a new benchmark for long-movie understanding! 🤔Do you think your model understands movies? Unlike existing benchmarks, MF² targets memorable events, emotional arcs 💔, and causal chains 🔗 — things humans recall easily, but even top models like…
Very excited to release our first reasoning model, Magistral. We released the weights of Magistral Small alongside a paper that presents our approach, online RL infrastructure, and findings.
Announcing Magistral, our first reasoning model designed to excel in domain-specific, transparent, and multilingual reasoning.
Excited to share that EmbodiedBench was selected for an Oral at ICML 2025! We recently added results for new models (InternVL3, Gemma3, Ovis2) and released a large agent trajectory dataset on 🤗: embodiedbench.github.io Try training and evaluating your MLLM for embodied agents!
🤖Can MLLM agents reason about spatial relationships and plan atomic actions for navigation & manipulation? 🔥 Meet EmbodiedBench 🏆—the first fine-grained benchmark for MLLM-based embodied agents! 📄 Paper: arxiv.org/abs/2502.09560 🌐 Website & code: embodiedbench.github.io
Congratulations to @Yoshua_Bengio on launching @LawZero_ — a research effort to advance safe-by-design AI, especially as frontier systems begin to exhibit signs of self-preservation and deceptive behaviour.
Thanks for sharing our work!GUI-Actor is a new GUI grounding method that combines an attention-based action head with a grounding verifier, different from previous text-based coordinate prediction methods.
Microsoft just dropped GUI-Actor on Hugging Face Coordinate-Free Visual Grounding for GUI Agents
We figured out how to train VLAs with diffusion outputs much faster (7.5x faster), inheriting better language following from the VLM, and leading to better results. The key: protect the VLM backbone during training with knowledge insulation. Let’s talk about what we learned👇
Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇
simulate python interpreter ❌ python interpreter world model ✅ 😂
Something interesting I just noticed in Hermes' cots during RL - It decided to simulate a python interpreter. The models yearn for the tools..
Impressive efficiency gains for Diffusion LLMs
🚀 Fast-dLLM: 27.6× Faster Diffusion LLMs with KV Cache & Parallel Decoding 💥 Key Features🌟 - Block-Wise KV Cache Reuses 90%+ attention activations via bidirectional caching (prefix/suffix), enabling 8.1×–27.6× throughput gains with <2% accuracy loss 🔄 -…
🚀Excited to share our new LLM math reasoning work! 🔥Supervised learning (as a replacement for RL) can reach SoTA performance on LLM math reasoning! 📊
Is self-improvement exclusive to RL? Can we use supervised learning to match LLMs trained with SOTA RL algorithms? In Negative-aware Fine-Tuning (NFT), we introduce a purely supervised learning method to enhance LLMs' math reasoning with no external teachers. NFT matches or…
🚀 We've been exploring long CoT reasoning models for quite a while. Today, we're excited to share a systematic framework that redefines how to reason efficiently with LLMs: 📌 Fractured Sampling — a unified strategy for parallel thinking at inference time.
🚀 A unified strategy for parallel decoding: Fractured CoT Reasoning We explore three dims of sampling: - Reasoning trajectories - Final solutions per traj - Depth of reasoning Maximize accuracy-cost trade-off! Allocate computation for huge gains. Paper: arxiv.org/pdf/2505.12992
🚨Wonder how parallel thinking works for #gemini pro 2.5? We too! Here is our preprint for exploring 3D thinking. A new training-free framework that boosts all pass@k, best-of-n and token efficiency for reasoning.
🚀 A unified strategy for parallel decoding: Fractured CoT Reasoning We explore three dims of sampling: - Reasoning trajectories - Final solutions per traj - Depth of reasoning Maximize accuracy-cost trade-off! Allocate computation for huge gains. Paper: arxiv.org/pdf/2505.12992