Xiao Liang

@MasterVito0601

MEng. @Tsinghua_Uni. Incoming Ph.D. student @UCLA. Research Intern @MSFTResearch. Reasoning and RL for LLMs.

Beijing, China

Joined October 2020

525Following

146Followers

Xiao Liang Retweeted

Chujie Zheng@ChujieZheng · Jul 25

Proud to introduce Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant RL algorithm that powers the large-scale RL training of the latest Qwen3 models (Instruct, Coder, Thinking) 🚀 📄 huggingface.co/papers/2507.18…

146

1.0K

715

79.0K

Xiao Liang Retweeted

Hongcheng Gao@GaoHongcheng · Jul 24

Can MLLMs truly see the world like humans? 👁️ We conduct a preliminary study using our benchmark TET 🎯, and found that nearly every SOTA (Claude-4, Gemini, o1, etc.) scores ≈ 0 % on some perception tasks that humans solve effortlessly, revealing a fundamental perceptual gap.

460

Xiao Liang Retweeted

Yining Hong@yining_hong · Jun 19

Meet Embodied Web Agents that bridge physical-digital realms. Imagine embodied agents that can search for online recipes, shop for ingredients and cook for you. Embodied web agents search internet information for implementing real-world embodied tasks. All data, codes and web…

160

46.0K

Xiao Liang Retweeted

Eric Jiang@Eric1706291394 · Jun 17

Thrilled to share our new work, EORM- Energy Outcome Reward Model!💡 Tired of complex process supervision or RLHF? 👋 We introduce EORM, a lightweight verifier that works post-hoc— a simple add-on model that ranks generated Chain-of-Thought solutions. No LLM retraining required!

3.0K

Xiao Liang@MasterVito0601 · Jun 16

We’ve released the prompts and a demo dataset of our synthetic problems—feel free to check them out! 📄 HuggingFace Paper (Begging for Upvote🥺): huggingface.co/papers/2506.08… 🧑‍💻 Github: github.com/MasterVito/SwS 🤗 Demo Dataset: huggingface.co/datasets/Maste…

XXiao Liang@MasterVito0601 · Jun 13

🙋‍♂️ Can RL training address model weaknesses without external distillation? 🚀 Please check our latest work on RL for LLM reasoning! 💯 TL;DR: We propose augmenting RL training with synthetic problems targeting model’s reasoning weaknesses. 📊Qwen2.5-32B: 42.9 → SwS-32B: 68.4

403

Xiao Liang@MasterVito0601 · May 22

Sorry, I deleted the previous post by accident, and this is a newly post one. BTW, what I want to say is: WE HAVE RELEASE THE CODE! check github.com/bigai-nlco/Lat…

HHengli Li@Hengli_Li_pku · May 22

🧐 Seek in the Dark 🤯No training🤯No data🤯No Reward Model LATENTSEEK: A novel framework that enhances LLM reasoning through Test-Time Instance-level Policy Gradient within the model’s latent space.

523

Xiao Liang@MasterVito0601 · Jun 13

Thanks Weizhu for mentoring and sharing of this work~ 🤗 The self-evolve pipeline can leverage the model’s self-consistency and instruction-following capabilities to enhance reasoning. 📜 For more details and additional settings, see our paper: arxiv.org/pdf/2506.08989

WWeizhu Chen@WeizhuChen · Jun 12

Synthesizing challenging problems that current model performs poorly is an important area in RL. Another thing interests me is the self-evolve learning via synthesizing questions/problems that the model can learn continuously. You may check our work here:mastervito.github.io/MasterVito.SwS…

534

Xiao Liang Retweeted

Zhongzhi Li@ZhongzhiLi4 · Jun 12

Hi everyone! The field of LLM-based reasoning has seen tremendous progress and rapid development over the past few months. We’ve updated our survey for exciting advances, now covering over 500 papers! 《From system 1 to system 2 a survey of reasoning large language models》

3.0K

Xiao Liang Retweeted

Zhongzhi Li@ZhongzhiLi4 · Jun 10

Our team from the Microsoft Research Asia, UCLA, Chinese Academy of Sciences, Tsinghua University, and released a paper, “TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression”proposing an innovative training method that effectively compresses the reasoning.

1.0K

Xiao Liang Retweeted

Han Jiang@SalomeJiang7 · May 10

🧐Are static benchmarks enough to assess the ethics of ever-advancing LLMs amid data leakage & saturation? In our #ICML2025 paper (arxiv.org/abs/2406.14230), we propose GETA—a generative evolving test inspired by CAT that adapts to model ability and probes LLMs' moral boundaries.

3.0K

Xiao Liang Retweeted

Dimitris Papailiopoulos@DimitrisPapail · May 1

We’ve been cooking... a new open weights 14B Phi-4 reasoning model, SFT’d on ~1.4M carefully curated reasoning demonstrations from o3-mini and RL’d for a tiny bit. This model is a little beast.

239

1.0K

798

434.0K

Xiao Liang Retweeted

Lilian Weng@lilianweng · Dec 2

🦃 At the end of Thanksgiving holidays, I finally finished the piece on reward hacking. Not an easy one to write, phew. Reward hacking occurs when an RL agent exploits flaws in the reward function or env to maximize rewards without learning the intended behavior. This is imo a…

226

2.0K

1.0K

275.0K