Seungone Kim

@seungonekim

Ph.D. student @LTIatCMU and intern at @AIatMeta working on (V)LM Evaluation & Systems that SeIf-Improve | Prev: @kaist_ai @yonsei_u

Pittsburgh, PA

Joined November 2021

926Following

2KFollowers

Pinned

Seungone Kim@seungonekim · Mar 26

#NLProc New paper on "evaluation-time scaling", a new dimension to leverage test-time compute! We replicate the test-time scaling behaviors observed in generators (e.g., o1, r1, s1) with evaluators by enforcing to generate additional reasoning tokens. arxiv.org/abs/2503.19877

seungonekim's tweet image. #NLProc
New paper on "evaluation-time scaling", a new dimension to leverage test-time compute!

We replicate the test-time scaling behaviors observed in generators (e.g., o1, r1, s1) with evaluators by enforcing to generate additional reasoning tokens.

arxiv.org/abs/2503.19877

170

22.0K

Seungone Kim@seungonekim · 14 h

I’ll be presenting our LLM-as-an-Interviewer work at #ACL2025! 📅 When: July 30 (wed) 11:00-12:30 📍 Where: Hall 4/5 arxiv.org/abs/2412.10424 Feel free to stop by ! Looking forward to discussing (m)LLM evaluation and more!

EEunsu Kim @ACL2025@euns0o_kim · Jan 2

[1/7] 🚨 New LLM Evaluation Paper Alert! How can we better understand LLMs' abilities? Why not interview them across multiple turns? 🎤 We introduce the LLM-as-an-Interviewer Framework, along with its summarized interview report! 👉 arxiv.org/abs/2412.10424

706

Seungone Kim Retweeted

Pranjal Aggarwal ✈️ ICML 2025@PranjalAggarw16 · Jul 17

Can LLMs self-improve on code generation? Check out our work AlphaVerus where model generates provably correct code and self-improves without any weight updates! At #ICML2025 today: 📆: 11:00 AM - 1:30 PM 📷: Poster #East-2912 alphaverus.github.io w/ Bryan, @wellecks

5.0K

Seungone Kim Retweeted

Akari Asai@AkariAsai · Jul 15

Some updates 🚨 I finished my Ph.D at @uwcse in June 2025! After a year at AI2 as a Research Scientist, I am joining CMU @LTIatCMU & @mldcmu (courtesy) as an Assistant Professor in Fall 2026. The journey, acknowledgments & recruiting in 🧵

114

1.0K

108

103.0K

Seungone Kim Retweeted

Sukjun (June) Hwang@sukjun_hwang · Jul 11

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

702

5.0K

4.0K

699.0K

Seungone Kim Retweeted

Xiang Yue@xiangyue96 · Jul 2

People are racing to push math reasoning performance in #LLMs—but have we really asked why? The common assumption is that improving math reasoning should transfer to broader capabilities in other domains. But is that actually true? In our study (arxiv.org/pdf/2507.00432), we…

127

609

398

58.0K

Seungone Kim Retweeted

Bo Liu (Benjamin Liu)@Benjamin_eecs · Jul 1

We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces…

268

179

63.0K

Seungone Kim Retweeted

Apurva Gandhi@apurvasgandhi · Jun 6

New preprint on web agents🚨 Go-Browse: Training Web Agents with Structured Exploration Problem: LLMs lack prior understanding of the websites that web agents will be deployed on. Solution: Go-Browse is an unsupervised method for automatically collecting diverse and realistic…

6.0K

Seungone Kim Retweeted

Jiseung Hong@jiseungh99 · Jun 5

👆 OpenAI recently rolled back its GPT- 4o update due to Sycophancy—being overly flattering and agreeable. 🧐 However, can we measure sycophancy in these Real-World failure cases? 🤗 Introducing SYCON-Bench, a benchmark that quantifies sycophancy in multi-turn dialogues! 📑 1/5

681

Seungone Kim Retweeted

Sean Welleck@wellecks · Jun 4

New paper by Andre He: Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening arxiv.org/abs/2506.02355 Tired of sharpening the distribution? Try unlikeliness reward to learn new things from the roads less traveled

357

332

31.0K

Seungone Kim Retweeted

Jiayi Geng@JiayiiGeng · May 27

Using LLMs to build AI scientists is all the rage now (e.g., Google’s AI co-scientist [1] and Sakana’s Fully Automated Scientist [2]), but how much do we understand about their core scientific abilities? We know how LLMs can be vastly useful (solving complex math problems) yet…

491

444

69.0K

Seungone Kim Retweeted

Shayne Longpre@ShayneRedford · May 23

🚨 @frimelle and I are looking for a junior collaborator to research the Open Model Ecosystem! 🤖 Ideally, someone w/ AI/ML background, who can help w/ annotation pipeline + analysis. docs.google.com/forms/d/e/1FAI…

13.0K

Seungone Kim Retweeted

Yizhong Wang@yizhongwyz · May 30

Thrilled to announce that I will be joining @UTAustin @UTCompSci as an assistant professor in fall 2026! I will continue working on language models, data challenges, learning paradigms, & AI for innovation. Looking forward to teaming up with new students & colleagues! 🤠🤘

101

669

73.0K

Seungone Kim Retweeted

Hyeonbin Hwang@ronalhwang · May 29

🚨 New Paper co-led with @bkjeon1211 🚨 Q. Can we adapt Language Models, trained to predict next token, to reason in sentence-level? I think LMs operating in higher-level abstraction would be a promising path towards advancing its reasoning, and I am excited to share our…

169

137

14.0K

Seungone Kim@seungonekim · May 24

Within the RAG pipeline, the retriever often acts as the bottleneck! Instead of training a better embedding model, we explore using a reasoning model both as the retriever&generator. To do this, we add MCTS to the generative retrieval pipeline. Check out @chaechaek1214's post!

CChaeeun Kim@chaechaek1214 · May 24

❓What if your RAG didn’t need a separate retrieval model at all? We present 🧊FREESON, a new framework for retriever-FREE retrieval-augmented reasoning. With FREESON, a single LRM acts as both generator and retriever, shifting the focus from seq2seq matching to locating…

3.0K

Seungone Kim Retweeted

Chaeeun Kim@chaechaek1214 · May 24

5.0K

Seungone Kim@seungonekim · May 22

We introduce Web Shepherd, the first PRM specialized for web navigation🌎 Prior works have used LLM-as-a-Judge to assess trajectories (RL) or each step (test-time algo.). Yet, this is not suitable in real-world scenarios since it takes too much time! Web Shepherd not only…

HHyungjoo Chae@hyungjoochae · May 22

🚀 Introducing Web-Shepherd: the first Process Reward Model (PRM) that guides web agents. 🌐 Current web browsing agents look cool, but they're not fully reliable! 😬They excel at simple tasks but struggle with complex ones. ❓ Can inference-time scaling help? Previous methods…

894

Seungone Kim@seungonekim · May 22

AAK@_akhaliq · May 22

Web-Shepherd just dropped on Hugging Face Advancing PRMs for Reinforcing Web Agents

16.0K

Seungone Kim Retweeted

AK@_akhaliq · May 22

Web-Shepherd just dropped on Hugging Face Advancing PRMs for Reinforcing Web Agents

122

28.0K

Seungone Kim@seungonekim · May 21

Turns out that reasoning models not only excel at solving problems but are also excellent confidence estimators - an unexpected side effect of long CoTs! This reminds me that smart ppl are good at determining what they know & don't know👀 Check out @dongkeun_yoon 's post!

DDongkeun Yoon@dongkeun_yoon · May 21

🙁 LLMs are overconfident even when they are dead wrong. 🧐 What about reasoning models? Can they actually tell us “My answer is only 60% likely to be correct”? ❗Our paper suggests that they can! Through extensive analysis, we investigate what enables this emergent ability.

964