Michael Luo
@michaelzluo
Project Lead @Agentica_ | Prev. Researcher @GoogleDeepMind | PhD at UC Berkeley @berkeley_ai
🚀The era of overpriced, black-box coding assistants is OVER. Thrilled to lead the @Agentica_ team in open-sourcing and training DeepSWE—a SOTA software engineering agent trained end-to-end with @deepseek_ai -like RL on Qwen32B, hitting 59% on SWE-Bench-Verified and topping the…
🚀 Introducing DeepSWE 🤖: our fully open-sourced, SOTA software engineering agent trained purely with RL on top of Qwen3-32B. DeepSWE achieves 59% on SWEBench-Verified with test-time scaling (and 42.2% Pass@1), topping the SWEBench leaderboard for open-weight models. 💪DeepSWE…
🔮 The future is AGENTS for all applications. In the first 6 months we perfected RL for verifiable‑reward reasoning—single step chain‑of‑thought, deterministic answers. Now, the next years belong to multi‑agent systems—multiple steps (does not need thought), multiple agents…
We've noticed that quite a lot of sources claim credit from one-off pipelining, which originated from our work DeepCoder. Not only SemiAnalysis @dylan522p but also bigger companies such as Meta's LLAMA RL paper (see Figure 2), that refuse to cite us to claim credit.
Unreal. 🤯 Someone just pointed out to me privately yet another case of plagiarism by @dylan522p. This time from a Together.AI blog post from April. Once again, they’ve recreated an image and stamped their name on it, just like the last one they claimed was merely…
✨ NEW SWE-Agents BENCHMARK ✨ Introducing GSO: The Global Software Optimization Benchmark - 👩🏻💻 100+ challenging software optimization tasks - 🛣️ a long-horizon task w/ precise specification - 🐘 large code changes in Py, C, C++, ... - 📉 SOTA models get < 5% success! 1/
It's easy to confuse Best@K vs Pass@K—and we've seen some misconceptions about our results. Our 59% on SWEBench-Verified is Pass@1 with Best@16, not Pass@8/16. Our Pass@8/16 is 67%/71%. So how did we achieve this? DeepSWE generates N candidate solutions. Then, another LLM…
Is it malpractice to report SOTA with pass@8 without using other models at pass@8 or just standard practice at this point? It's clearly not SOTA if it's behind Devstral in a pass@1
RL with verifiable reward has shown impressive results in improving LLM reasoning, but what can we do when we do not have ground truth answers? Introducing Self-Rewarding Training (SRT): where language models provide their own reward for RL training! 🧵 1/n
As we go through a lot of excitement about RL recently with lots of cool work/results, here is a reminder that RL with a reverse KL-regularizer to the base model cannot learn new skills that were not already present in the base model. It can only amplify the existing weak skills.
🌏 Building web-scale agents, and tired of Math and Coding tasks? Come chat with us at ICLR in Singapore. We are presenting InSTA at the DATA-FM workshop in the second Oral session, April 28th 2:30pm. InSTA is the largest environment for training agents, spanning 150k live…
the vLLM vs SGLang beef is the weirdest (and saddest) thing ever both are under the Linux foundation, could join forces and make the best inference framework ever :/
📢LLM and RL folks! 📢 No good RL algorithm for credit assignment for multi-turn LLM agents on reasoning-heavy tasks? Do not even have a good benchmark for studying it? In SWEET-RL, we give you both (a vibe coding benchmark and SWEET algorithm). A thread 🧵(1/n)
Today we’re launching INTELLECT-2: The first decentralized 32B-parameter RL training run open to join for anyone with compute — fully permissionless. Scaling towards frontier reasoning across coding, math and science.
We're trending on @huggingface models today! 🔥 Huge thanks to our amazing community for your support. 🙏
This week @encord_team hosted AI After Hours at @github HQ and our Foundation Model Lead, Vishal Satish, shared how Ambi Robotics is leveraging 200K+ hours of high-fidelity production data to train PRIME-1—a domain-expert foundation model designed for industrial reliability.
Preprint: Can we learn to reason for story generation (~100k tokens), without reward models? Yes! We introduce an RLVR-inspired reward paradigm VR-CLI that correlates with human judgements of quality on the 'novel' task of Next-Chapter Prediction. Paper: arxiv.org/abs/2503.22828
Excited to release R2E-Gym - 🔥 8.1K executable environments using synthetic data - 🧠 Hybrid verifiers for enhanced inference-time scaling - 📈 51% success-rate on the SWE-Bench Verified - 🤗 Open Source Data + Models + Trajectories 1/
🚀 We introduce DeepCoder-14B-Preview, a fully open-sourced coding model that is on par with o3-mini and o1! 📷 We scaled our model with RL magic up to 32K context. It's performance scales to 64K context 🔥
Introducing DeepCoder-14B-Preview - our fully open-sourced reasoning model reaching o1 and o3-mini level on coding and math. The best part is, we’re releasing everything: not just the model, but the dataset, code, and training recipe—so you can train it yourself!🔥 Links below:
I’m excited to share a project I’ve been working on for over a year, which I believe will fundamentally change our approach to language models. We’ve designed a new architecture, which replaces the hidden state of an RNN with a machine learning model. This model compresses…
Deepseek just announced Inference-Time Scaling for Generalist Reward Modeling on Hugging Face show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve…
Prompt-to-Leaderboard is now LIVE❤️🔥 Input any prompt → leaderboard for you in real-time. Huge shoutout to the incredible team that made this happen! @evan_a_frick @connorzchen @joseph_ten4849 @LiTianleli @infwinston @ml_angelopoulos @istoica05
Introducing Prompt-to-leaderboard (P2L): a real-time LLM leaderboard tailored exactly to your use case! P2L trains an LLM to generate "prompt-specific" leaderboards, so you can input a prompt and get a leaderboard specifically for that prompt. The model is trained on the 2M…