Kaixuan Huang
@KaixuanHuang1
AGI strategist. PhD Student @Princeton; Google PhD Fellowship 2024, Ex-Intern @GoogleDeepMind; undergrad @PKU1898. opinions my own
Do LLMs have true generalizable mathematical reasoning capability or are they merely memorizing problem-solving skills? 🤨 We present MATH-Perturb, modified level-5 problems from MATH dataset to benchmark LLMs' generalizability to slightly perturbed problems. 🔗…

Just returned from ICML 2025 where I had the honor of keynoting three remarkable workshops. Grateful for the opportunity to delve into topics like self-evolving Alita agents, CRISPR-GPT for AI-driven science, Genome-Bench, reinforcement-learning agents, and AI biosafety. Special…
💔 2nd&3rd deaths linked to Sarepta gene therapy—trial pause, stock drop. Must accelerate safer gene & cell cures. AI design & AI agents + real world validation can help contribute! 🚀 AI momentum: SynBioBeta’s “Towards an AI-Driven CRISPR Future” (synbiobeta.com/read/towards-a…) charts…
I'd like to see Meta building a lean LLM team around Narang, Allen-Zhu, Mike Lewis, Zettlemoyer and Sukhbaatar and giving them all the budget and power.
Given the sheer number of ppl interested in PG methods nowadays I'm sure innocent "rediscoveries" like this are happening everyday. Otoh, due diligence takes minimal effort today as you can just DeepResearch. All it takes is the sense/taste to ask "no way this is not done b4"...
I read this paper in detail, and I am very sad! They literally re-do the optimal reward baseline work that we have known since forever, without even crediting the true authors in their derivations. The third screenshot is taken from: ieeexplore.ieee.org/stamp/stamp.js… As you see, they…
Glad to see that CRISPR-GPT inspired wonderful work on general-purpose biomedical agents 😀 Congrats on the release of Biomni!
📢 Introducing Biomni - the first general-purpose biomedical AI agent. Biomni is built on the first unified environment for biomedical agent with 150 tools, 59 databases, and 106 software packages and a generalist agent design with retrieval, planning, and code as action. This…
The next ~1-4 years will be taking the 2017-2020 years of Deep RL and scaling up: exploration, generalization, long-horizon tasks, credit assignment, continual learning, multi-agent interaction! Lots of cool work to be done! 🎮🤖 But we shouldn't forget big lessons from back…
"Speeding up LLMs using discrete diffusion models", now by Gemini Diffusion. Whoever has access, please tell me whether the model only supports deterministic generations --- outputting the same response every time it's given the same input.
I just found my old email written on Dec 4, 2023, where I talked about the three research directions I am most excited about. After 1.5 years: (1) was done by o1, QwQ, Deepseek-R1, etc. (2) is being explored in @InceptionAILabs. It seems (3) is still ongoing and hasn't been…
2025 is the year of benchmarks and agents. 2026 will be the year of the unified world model.
Evaluations are essential to understanding how models perform in health settings. HealthBench is a new evaluation benchmark, developed with input from 250+ physicians from around the world, now available in our GitHub repository. openai.com/index/healthbe…
Congrats to Kai Li on being named a member of the American Academy of Arts & Sciences! 🎉 Li joined @Princeton in 1986 and has made important contributions to several research areas in computer science. bit.ly/3RPLxas
I just found my old email written on Dec 4, 2023, where I talked about the three research directions I am most excited about. After 1.5 years: (1) was done by o1, QwQ, Deepseek-R1, etc. (2) is being explored in @InceptionAILabs. It seems (3) is still ongoing and hasn't been…

Thrilled to know that our paper, `Safety Alignment Should be Made More Than Just a Few Tokens Deep`, received the ICLR 2025 Outstanding Paper Award. We sincerely thank the ICLR committee for awarding one of this year's Outstanding Paper Awards to AI Safety / Adversarial ML.…
Outstanding Papers Safety Alignment Should be Made More Than Just a Few Tokens Deep. Xiangyu Qi, et al. Learning Dynamics of LLM Finetuning. Yi Ren and Danica J. Sutherland. AlphaEdit: Null-Space Constrained Model Editing for Language Models. Junfeng Fang, et al.
When I tested the performance of o3-mini on MATH-Perturb, I found that it performed significantly worse than o1-mini. After inspecting the raw outputs, I discovered that o3-mini used a lot of Unicode characters, and my previous parser failed to process them. So I hand-crafted a…

Life update: Following my recent graduation, I've joined the Bytedance Seed Edge team to pursue this research direction further. Although this post was written last year, my conviction in this approach has only strengthened (many ideas here echo compelling recent writings from…
x.com/i/article/1848…