❄️Andrew Zhao❄️@ICML25
@_AndrewZhao
PhD @Tsinghua_Uni. Absolute Zero,ExpeL,Diver-CT Research Intern @MSFTResearch, Ex. @ BIGAI. Interested in RL, Reasoning/Safety 4 LLMs, Agents. On job market 26'
❄️Introducing Absolute Zero Reasoner: Our reasoner learns to both propose tasks that maximize learnability and improve reasoning by solving them, entirely through self-play—with no external data! It overall outperforms other "zero" models in math & coding domains. 🧵 1/

🚨 New Paper Alert The Invisible Leash: Why RLVR May Not Escape Its Origin 📄 arxiv.org/abs/2507.14843 By Fang Wu*, Weihao Xuan*, Ximing Lu, Zaid Harchaoui, Yejin Choi Does RLVR expand reasoning capabilities—or just amplify what models already know? 🧵 Thread with key in sights
Unlock the Hidden Diversity in Your Language Model. In our new paper, Intent Factored Generation (IFG), we propose an inference time method to increase the diversity of generations from LLMs. IFG leads to improvements in searching for solutions to maths and code problems. (1/6)
Unlock real diversity in your LLM! 🚀 LLM outputs can be boring and repetitive. Today, we release Intent Factored Generation (IFG) to: - Sample conceptually diverse outputs💡 - Improve performance on math and code reasoning tasks🤔 - Get more engaging conversational agents 🤖
The Invisible Leash: Why RLVR May Not Escape Its Origin "RLVR is constrained by the base model's support-unable to sample solutions with zero initial probability-and operates as a conservative reweighting mechanism that may restrict the discovery of entirely original solutions"…
Anthropic just released a research paper. Inverse Scaling in Test-Time Compute This study shows that longer reasoning in Large Reasoning Models (LRMs) can hurt performance—revealing a surprising inverse scaling between reasoning length and accuracy. According to this paper,…
Officially validated IMO gold medal, purely via search in token space, achieved in 4.5 hrs (unclear at what compute cost). The solutions read nicely as well deepmind.google/discover/blog/…
Excited to share that a scaled up version of Gemini DeepThink achieves gold-medal standard at the International Mathematical Olympiad. This result is official, and certified by the IMO organizers. Watch out this space, more to come soon! deepmind.google/discover/blog/…
Dale talks about Large Language Models and Computation at the Programmatic Representations for Agent Learning workshop at #ICML2025!
Our #ICML2025 Programmatic Representations for Agent Learning workshop will take place tomorrow, July 18th, at the West Meeting Room 301-305, exploring how programmatic representations can make agent learning more interpretable, generalizable, efficient, and safe! Come join us!
adding post training datasets will not lead to techno-economic machine runaway
Congratulations to Sergey Ioffe & Christian Szegedy, authors of "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", recipients of the #ICML2025 Test-of-Time Award! arxiv.org/abs/1502.03167
I will present this work at the ICML Multi-Agent System (MAS) workshop during the poster sessions. If you are interested in this work or self-play LLMs in general, please feel free to come chat with me!
🤔Conventional LM safety alignment is reactive: find vulnerabilities→patch→repeat 🌟We propose 𝗼𝗻𝗹𝗶𝗻𝗲 𝐦𝐮𝐥𝐭𝐢-𝐚𝐠𝐞𝐧𝐭 𝗥𝗟 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 where Attacker & Defender self-play to co-evolve, finding diverse attacks and improving safety by up to 72% vs. RLHF 🧵
If you want one take away from ICML: there’s gonna be 100 papers on exploration for reasoning in the coming month(s)
Most AI benchmarks test the past. But real intelligence is about predicting the future. Introducing FutureBench — a new benchmark for evaluating agents on real forecasting tasks that we developed with @huggingface 🔍 Reasoning > memorization 📊 Real-world events 🧠 Dynamic,…
Something to watch out for when evaluating tool-using agents: they can "cheat" by browsing the web and simply looking up the answer key. The @OpenAI ChatGPT Agent team had to take special care to mitigate this risk.
ChatGPT agent’s capabilities are reflected in its state-of-the-art performance on academic and real-world task evaluations, like data modeling, spreadsheet editing, and investment banking.
Recent work has seemed somewhat magical: how can RL with *random* rewards make LLMs reason? We pull back the curtain on these claims and find out this unexpected behavior hinges on the inclusion of certain *heuristics* in the RL algorithm. Our blog post: tinyurl.com/heuristics-con…
Thrilled that our paper was selected for the #ICML2025 AI4Math Best Paper Award! 🎉 Sadly, I can’t attend in person due to visa issues, but Andrew will present on behalf of our team. 🎤 Don’t miss his talk—check it out! 13:45-14:00 pm, July 18. Ballroom C, West Building.
🚀 Hello, Kimi K2! Open-Source Agentic Model! 🔹 1T total / 32B active MoE model 🔹 SOTA on SWE Bench Verified, Tau2 & AceBench among open models 🔹Strong in coding and agentic tasks 🐤 Multimodal & thought-mode not supported for now With Kimi K2, advanced agentic intelligence…
Releasing SYNTHETIC-2: our open dataset of 4m verified reasoning traces spanning a comprehensive set of complex RL tasks and verifiers. Created by hundreds of compute contributors across the globe via our pipeline parallel decentralized inference stack. primeintellect.ai/blog/synthetic…
new blog: How to scale RL to 10^26 FLOPs everyone is trying to figure out the right way to scale reasoning with RL ilya compared the Internet to fossil fuel: it may be the only useful data we have. and it's expendable perhaps we should learn to reason from The Internet (not…