Seungju Han
@SeungjuHan3
language models & reasoning. incoming cs phd student @stanford +intern @nvidiaai. prev @allen_ai @SeoulNatlUni
Given the confusion around what RL does for reasoning in LLMs, @setlur_amrith & I wrote a new blog post on when RL simply sharpens the base model & when it discovers new reasoning strategies. Learn how to measure discovery + methods to enable it ⬇️ tinyurl.com/rlshadis
Introducing AceReason-Nemotron 1.1 Our previous release, AceReason-Nemotron-1.0, introduced a stage-wise RL recipe that was applied sequentially to math-only and code-only prompts, demonstrating both high efficiency and strong effectiveness. Here, we systematically investigate…
(1/x) Excited to share our new work on MAPoRL🍁: Multi-Agent Post-Co-Training for Collaborative LLMs with RL. Most current approaches just prompt pre-trained models and hope they’ll work together. But can we train LLM to discover the collaboration strategy?
Gemini solved the math problems end-to-end in natural language (English). This differs from our results last year when experts first translated them into formal languages like Lean for specialized systems to tackle.
life update: I'll be starting my PhD in CS at Stanford this September! I'm very excited to continue my research on reasoning of language models and to make new friends in the Bay Area! I'm deeply grateful to everyone who supported me and made this milestone possible…
Does RL actually learn positively under random rewards when optimizing Qwen on MATH? Is Qwen really that magical such that even RLing on random rewards can make it reason better? Following prior work on spurious rewards on RL, we ablated algorithms. It turns out that if you…
Recent work has seemed somewhat magical: how can RL with *random* rewards make LLMs reason? We pull back the curtain on these claims and find out this unexpected behavior hinges on the inclusion of certain *heuristics* in the RL algorithm. Our blog post: tinyurl.com/heuristics-con…
how do people fairly evaluate agents with web access on benchmarks like HLE or GPQA? there could be content directly related to the benchmark on the web (e.g. blogpost showing an example from the benchmark), how is this issue addressed?
n-simplex attention makes incredible sense because of its honesty: it literally says you can put more compute on attention operation to get more gains: we've seen this trend so many times. This differs from lot of 'suspicious' claim, such as you can use less compute to perform…
💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR. But the set of constraints and verifier functions is limited and most models overfit on IFEval. We introduce IFBench to measure model generalization to unseen constraints.
🧩 Can RL learn to compose math skills into integrated solutions? ✅ Strong on isolated skills ❌ Weak on composition: even when models mastered A & B, they failed on A⊕B. RL strengthens atomic skills but struggles to teach models how to compose. 👤 Unlike humans, who…
And can more inference-time compute solve harder problems? ⚠️ it Helps at moderate complexity, but Gains plateau at higher levels Due to budget constraints, we limited testing to 64 attempts, but given the zero performance, we speculate that increasing beyond this point would…
Seeing text-to-text regression work for Google’s massive compute cluster (billion $$ problem!) was the final result to convince us we can reward model literally any world feedback. Paper: arxiv.org/abs/2506.21718 Code: github.com/google-deepmin… Just train a simple encoder-decoder…
Claude Sonnet 3.5 generated significantly better ideas for research papers than humans, but when researchers tried executing the ideas the gap between human & AI idea quality disappeared Execution is a harder problem for AI. (Yet this is a better outcome for AI than I expected)
Verrrrry intriguing-looking and labor-intensive test of whether LLMs can come up with good scientific ideas. After implementing those ideas, the verdict seems to be "no, not really."
Hello Gemini 2.5 Flash-Lite! So fast, it codes *each screen* on the fly (Neural OS concept 👇). The frontier isn't always about large models and beating benchmarks. In this case, a super fast & good model can unlock drastic use cases. Read more: blog.google/products/gemin…
Always thought provoking!
Q-learning is not yet scalable seohong.me/blog/q-learnin… I wrote a blog post about my thoughts on scalable RL algorithms. To be clear, I'm still highly optimistic about off-policy RL and Q-learning! I just think we haven't found the right solution yet (the post discusses why).
Is RL really scalable like other objectives? We found that just scaling up data and compute is *not* enough to enable RL to solve complex tasks. The culprit is the horizon. Paper: arxiv.org/abs/2506.04168 Thread ↓