Xiao Liang
@MasterVito0601
MEng. @Tsinghua_Uni. Incoming Ph.D. student @UCLA. Research Intern @MSFTResearch. Reasoning and RL for LLMs.
Proud to introduce Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant RL algorithm that powers the large-scale RL training of the latest Qwen3 models (Instruct, Coder, Thinking) 🚀 📄 huggingface.co/papers/2507.18…
Can MLLMs truly see the world like humans? 👁️ We conduct a preliminary study using our benchmark TET 🎯, and found that nearly every SOTA (Claude-4, Gemini, o1, etc.) scores ≈ 0 % on some perception tasks that humans solve effortlessly, revealing a fundamental perceptual gap.
Meet Embodied Web Agents that bridge physical-digital realms. Imagine embodied agents that can search for online recipes, shop for ingredients and cook for you. Embodied web agents search internet information for implementing real-world embodied tasks. All data, codes and web…
Thrilled to share our new work, EORM- Energy Outcome Reward Model!💡 Tired of complex process supervision or RLHF? 👋 We introduce EORM, a lightweight verifier that works post-hoc— a simple add-on model that ranks generated Chain-of-Thought solutions. No LLM retraining required!
We’ve released the prompts and a demo dataset of our synthetic problems—feel free to check them out! 📄 HuggingFace Paper (Begging for Upvote🥺): huggingface.co/papers/2506.08… 🧑💻 Github: github.com/MasterVito/SwS 🤗 Demo Dataset: huggingface.co/datasets/Maste…
🙋♂️ Can RL training address model weaknesses without external distillation? 🚀 Please check our latest work on RL for LLM reasoning! 💯 TL;DR: We propose augmenting RL training with synthetic problems targeting model’s reasoning weaknesses. 📊Qwen2.5-32B: 42.9 → SwS-32B: 68.4
Sorry, I deleted the previous post by accident, and this is a newly post one. BTW, what I want to say is: WE HAVE RELEASE THE CODE! check github.com/bigai-nlco/Lat…
🧐 Seek in the Dark 🤯No training🤯No data🤯No Reward Model LATENTSEEK: A novel framework that enhances LLM reasoning through Test-Time Instance-level Policy Gradient within the model’s latent space.
Thanks Weizhu for mentoring and sharing of this work~ 🤗 The self-evolve pipeline can leverage the model’s self-consistency and instruction-following capabilities to enhance reasoning. 📜 For more details and additional settings, see our paper: arxiv.org/pdf/2506.08989
Synthesizing challenging problems that current model performs poorly is an important area in RL. Another thing interests me is the self-evolve learning via synthesizing questions/problems that the model can learn continuously. You may check our work here:mastervito.github.io/MasterVito.SwS…
Hi everyone! The field of LLM-based reasoning has seen tremendous progress and rapid development over the past few months. We’ve updated our survey for exciting advances, now covering over 500 papers! 《From system 1 to system 2 a survey of reasoning large language models》
Our team from the Microsoft Research Asia, UCLA, Chinese Academy of Sciences, Tsinghua University, and released a paper, “TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression”proposing an innovative training method that effectively compresses the reasoning.
🧐Are static benchmarks enough to assess the ethics of ever-advancing LLMs amid data leakage & saturation? In our #ICML2025 paper (arxiv.org/abs/2406.14230), we propose GETA—a generative evolving test inspired by CAT that adapts to model ability and probes LLMs' moral boundaries.
We’ve been cooking... a new open weights 14B Phi-4 reasoning model, SFT’d on ~1.4M carefully curated reasoning demonstrations from o3-mini and RL’d for a tiny bit. This model is a little beast.
🦃 At the end of Thanksgiving holidays, I finally finished the piece on reward hacking. Not an easy one to write, phew. Reward hacking occurs when an RL agent exploits flaws in the reward function or env to maximize rewards without learning the intended behavior. This is imo a…