Seungju Han

@SeungjuHan3

language models & reasoning. incoming cs phd student @stanford +intern @nvidiaai. prev @allen_ai @SeoulNatlUni

🇰🇷

Joined December 2020

433Following

1KFollowers

Pinned

Seungju Han Retweeted

Aviral Kumar@aviral_kumar2 · Jun 24

Given the confusion around what RL does for reasoning in LLMs, @setlur_amrith & I wrote a new blog post on when RL simply sharpens the base model & when it discovers new reasoning strategies. Learn how to measure discovery + methods to enable it ⬇️ tinyurl.com/rlshadis

273

247

16.0K

Pinned

Seungju Han Retweeted

Wei Ping@_weiping · Jun 17

Introducing AceReason-Nemotron 1.1 Our previous release, AceReason-Nemotron-1.0, introduced a stage-wise RL recipe that was applied sequentially to math-only and code-only prompts, demonstrating both high efficiency and strong effectiveness. Here, we systematically investigate…

6.0K

Seungju Han Retweeted

Chanwoo Park@chanwoopark20 · Jul 23

(1/x) Excited to share our new work on MAPoRL🍁: Multi-Agent Post-Co-Training for Collaborative LLMs with RL. Most current approaches just prompt pre-trained models and hope they’ll work together. But can we train LLM to discover the collaboration strategy?

4.0K

Seungju Han Retweeted

Google DeepMind@GoogleDeepMind · Jul 21

Gemini solved the math problems end-to-end in natural language (English). This differs from our results last year when experts first translated them into formal languages like Lean for specialized systems to tackle.

362

27.0K

Seungju Han@SeungjuHan3 · Jul 21

life update: I'll be starting my PhD in CS at Stanford this September! I'm very excited to continue my research on reasoning of language models and to make new friends in the Bay Area! I'm deeply grateful to everyone who supported me and made this milestone possible…

742

116

67.0K

Seungju Han@SeungjuHan3 · Jul 16

Does RL actually learn positively under random rewards when optimizing Qwen on MATH? Is Qwen really that magical such that even RLing on random rewards can make it reason better? Following prior work on spurious rewards on RL, we ablated algorithms. It turns out that if you…

GGokul Swamy@g_k_swamy · Jul 15

Recent work has seemed somewhat magical: how can RL with *random* rewards make LLMs reason? We pull back the curtain on these claims and find out this unexpected behavior hinges on the inclusion of certain *heuristics* in the RL algorithm. Our blog post: tinyurl.com/heuristics-con…

104

12.0K

Seungju Han@SeungjuHan3 · Jul 9

how do people fairly evaluate agents with web access on benchmarks like HLE or GPQA? there could be content directly related to the benchmark on the web (e.g. blogpost showing an example from the benchmark), how is this issue addressed?

871

Seungju Han Retweeted

Simo Ryu@cloneofsimo · Jul 4

n-simplex attention makes incredible sense because of its honesty: it literally says you can put more compute on attention operation to get more gains: we've seen this trend so many times. This differs from lot of 'suspicious' claim, such as you can use less compute to perform…

524

308

47.0K

Seungju Han Retweeted

Valentina Pyatkin@valentina__py · Jul 3

💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR. But the set of constraints and verifier functions is limited and most models overfit on IFEval. We introduce IFBench to measure model generalization to unseen constraints.

353

182

46.0K

Seungju Han Retweeted

Nouha Dziri@nouhadziri · Jun 24

🧩 Can RL learn to compose math skills into integrated solutions? ✅ Strong on isolated skills ❌ Weak on composition: even when models mastered A & B, they failed on A⊕B. RL strengthens atomic skills but struggles to teach models how to compose. 👤 Unlike humans, who…

3.0K

Seungju Han Retweeted

Nouha Dziri@nouhadziri · Jun 24

And can more inference-time compute solve harder problems? ⚠️ it Helps at moderate complexity, but Gains plateau at higher levels Due to budget constraints, we limited testing to 64 attempts, but given the zero performance, we speculate that increasing beyond this point would…

4.0K

Seungju Han Retweeted

Richard Song@XingyouSong · Jun 30

Seeing text-to-text regression work for Google’s massive compute cluster (billion $$ problem!) was the final result to convince us we can reward model literally any world feedback. Paper: arxiv.org/abs/2506.21718 Code: github.com/google-deepmin… Just train a simple encoder-decoder…

513

421

71.0K

Seungju Han@SeungjuHan3 · Jun 27

Claude Sonnet 3.5 generated significantly better ideas for research papers than humans, but when researchers tried executing the ideas the gap between human & AI idea quality disappeared Execution is a harder problem for AI. (Yet this is a better outcome for AI than I expected)

SScience of Science @ AOM@MishaTeplitskiy · Jun 27

Verrrrry intriguing-looking and labor-intensive test of whether LLMs can come up with good scientific ideas. After implementing those ideas, the verdict seems to be "no, not really."

405

169

44.0K

Seungju Han Retweeted

Oriol Vinyals@OriolVinyalsML · Jun 17

Hello Gemini 2.5 Flash-Lite! So fast, it codes *each screen* on the fly (Neural OS concept 👇). The frontier isn't always about large models and beating benchmarks. In this case, a super fast & good model can unlock drastic use cases. Read more: blog.google/products/gemin…

278

2.0K

763

410.0K

Seungju Han@SeungjuHan3 · Jun 14

Always thought provoking!

SSeohong Park@seohong_park · Jun 13

Q-learning is not yet scalable seohong.me/blog/q-learnin… I wrote a blog post about my thoughts on scalable RL algorithms. To be clear, I'm still highly optimistic about off-policy RL and Q-learning! I just think we haven't found the right solution yet (the post discusses why).

1.0K

Seungju Han Retweeted

Seohong Park@seohong_park · Jun 5

Is RL really scalable like other objectives? We found that just scaling up data and compute is *not* enough to enable RL to solve complex tasks. The culprit is the horizon. Paper: arxiv.org/abs/2506.04168 Thread ↓

143

920

751

135.0K