Mingyang Chen
@chen_mingyang
LLM Engineer @BaichuanAI | Ph.D. from @ZJU_China | Visiting Researcher @EdinburghUni. LLMs, Post-training, Reasoning.
🚀 Exciting news! Our open-source project ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning has just released its models, paper, and updated code! Check it out: github.com/Agent-RL/ReSea…
ReSearch Learning to Reason with Search for LLMs via Reinforcement Learning
Many thanks to AK for sharing our work! Introducing "ReCode: Updating Code API Knowledge with Reinforcement Learning" — the RL framework that teaches models to update code API knowledge. Paper: huggingface.co/papers/2506.20… Code: github.com/zjunlp/ReCode 📚 Trained on 2K+ API…
ReCode Updating Code API Knowledge with Reinforcement Learning
Introducing AutoMind: Adaptive Knowledgeable Agent for Automated Data Science Paper: arxiv.org/abs/2506.10974 Code (will be released soon): github.com/innovatingAI/A… Our latest work AutoMind is a new LLM agent framework that automates end-to-end machine learning pipelines by…
Impressive work on the reasoning model for AI in scientific research! Keep pushing the boundaries of LLM reasoning.
🧩 Introducing Cell-o1, a reasoning-enhanced LLM for batch-level single-cell annotation. 🧬 It takes on CellPuzzles, a new benchmark where each batch contains multiple cells with globally unique types – requiring joint reasoning over gene expression and context. 🧠 Not just…
We introduce Reinforcing "Cognitive Experts" – a new approach to enhance reasoning in MoE-based Large Reasoning Models (LRMs) 🌟. Thanks to Tencent's support, we had the opportunity to explore the inner workings of ultra-large models like DeepSeek-R1-671B and Qwen3-235B. By…
A conversation on the optimal reward for coding agents, infinite context models, and real-time RL
RL without ground-truth answer!
🎉 Introducing our latest work ! 🚀 We propose a label-free method that enables RL without ground truth answer, yet achieves impressive performance on mathematical tasks: 40.0% accuracy on AIME2024 🎯 with a 7B base model. Paper: huggingface.co/papers/2505.19…
与朋友闲聊,他提到自己一直在寻找身边的"超级节点",通过与这些高质量对话者交流来"训练"自己,就像训练大LLM。大模型工程师完全可以把训练大模型的经验迁移到和人聊天这件事情上: 1️⃣ 精选"语料":找到高质量、与你匹配的交流对象,就像AI需要优质数据集 2️⃣…
Is it possible to “give medicine” (vectors) to large language models to control their behavior at inference time? 🚀 Excited to share our ACL 2025 work: Steering Target Atoms (STA) — an approach to controlling LLM behavior at inference time without retraining. #ACL2025 #LLM #AI…
THE WAY OF CODE, a project by @rickrubin in collaboration with Anthropic:
👏 Great work building on our open-source ReCall project! github.com/Agent-RL/ReCall
🌟 Introducing Tool-Star, a powerful RL-based multi-tool reasoner! 🚀Achieving impressive results on 10+ top reasoning benchmarks with only 3B params! 🥳 We’ve open-sourced all the code, datasets, and checkpoints Github: github.com/dongguanting/T… Paper: huggingface.co/papers/2505.16…
Giving your models more time to think before prediction, like via smart decoding, chain-of-thoughts reasoning, latent thoughts, etc, turns out to be quite effective for unblocking the next level of intelligence. New post is here :) “Why we think”: lilianweng.github.io/posts/2025-05-…
🚨 New paper 🚨 J1: Incentivizing Thinking in LLM-as-a-Judge via RL - Converts judgement task into a verifiable one for both verifiable and non-verifiable prompts. Uses only synthetic pairwise data - Optimizes thoughts, scores, and judgments using GRPO - Outperforms all…
Nice work for efficient post-training on reasoning with search for LLMs!
RL for Search-Efficient LLMs Presents a new post-training RL framework that explicitly trains LLMs to optimize search usage. Recipe: structured reasoning template and reward policy + GRPO Leads to smarter and more efficient reasoning and retrieval of external knowledge.
🚨 New Blog Drop! 🚀 "Reflection on Knowledge Editing: Charting the Next Steps" is live! 💡 Ever wondered why knowledge editing in LLMs still feels more like a lab experiment than a real-world solution? In this post, we dive deep into where the research is thriving — and where…
isn’t it weird that we spend >99% of RL compute on the agent and <1% on the environment (eg python runtime)? who is working on scaling environment compute? and is there an optimal agent-environment compute frontier for intelligence?
Honored to contribute to a survey on Full-Stack LLM Safety 🔒🤖 A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment ArXiv: arxiv.org/abs/2504.15585 We go beyond existing works by covering the entire lifecycle of LLMs—from data to deployment…
1/N Introducing SkyRL-v0, our RL training pipeline enabling efficient RL training for long-horizon, real-environment tasks like SWE-Bench. We also open-source a series of our early trained models to showcase the potential of end-to-end online RL training on long-horizon (20-50…
感谢马老师分享我们近期正在推进的一项工作,观点非常有洞察!Stay tuned!
「Reasoning, Agent」WIP论文 ReCall: Learning to Reason with Tool Call for LLMs via Reinforcement Learning ReCall, 是ReSearch的全面升级,不止搜索,它让LLM协同外部多种工具,一次关于通用 Agent的重要尝试。 起点:ReSearch = “学会反复 Search”。 核心方法是利用结构化符号 (“协议…