Yuda Song

@yus167

PhD @mldcmu. Previously @ucsd_cse @UcsdMathDept

Joined April 2020

285Following

405Followers

Pinned

Yuda Song Retweeted

Gene Li@geneli0 · Jul 21

like everyone else i am hopping on the blog post trend gene.ttic.edu/blog/incomplet…

177

189

16.0K

Pinned

Yuda Song@yus167 · Jun 26

Discussing "Mind the Gap" tonight at @haizelabs's NYC AI Reading Group with @leonardtang_ and @willccbb. Authors study self-improvement through the "Generation-Verification Gap" (model's verification ability over its own generations) and find that this capability log scales with…

NNimit Kalra @ ICML 2025@qw3rtman · Jun 7

Still noodling on this, but the generation-verification gap proposed by @yus167 @_hanlin_zhang_ @ShamKakade6 @udayaghai et al. in arxiv.org/abs/2412.02674 is a very nice framework that unifies a lot of thoughts around self-improvement/verification/bootstrapping reasoning

9.0K

Pinned

Yuda Song@yus167 · Jun 7

NNimit Kalra @ ICML 2025@qw3rtman · May 4

when/is verification harder than specification?

12.0K

Yuda Song Retweeted

Gaurav Ghosal@gaurav_ghosal · Jul 17

1/So much of privacy research is designing post-hoc methods to make models mem. free. It’s time we turn that around with architectural changes. Excited to add Memorization Sinks to the transformer architecture this #ICML2025 to isolate memorization during LLM training🧵

7.0K

Yuda Song Retweeted

Kaiwen Wang@kaiwenw_ai · Jul 17

I’m presenting two papers on value-based RL for post-training & reasoning on Friday at @ai4mathworkshop at #ICML2025! 1️⃣ Q#: lays theoretical foundations for value-based RL for post-training LMs; 2️⃣ VGS: practical value-guided search scaled up for long CoT reasoning. 🧵👇

11.0K

Yuda Song Retweeted

Yong Lin@Yong18850571 · Jul 15

(1/4)🚨 Introducing Goedel-Prover V2 🚨 🔥🔥🔥 The strongest open-source theorem prover to date. 🥇 #1 on PutnamBench: Solves 64 problems—with far less compute. 🧠 New SOTA on MiniF2F: * 32B model hits 90.4% at Pass@32, beating DeepSeek-Prover-V2-671B’s 82.4%. * 8B > 671B: Our 8B…

249

119

57.0K

Yuda Song@yus167 · Jul 13

Please attend @yidingjiang 's oral presentation of our work, Paprika, at ICML!

YYiding Jiang@yidingjiang · Jul 13

I will talk about how to train agents with decision making capabilities that generalize to completely new environments: x.com/FahimTajwar10/…

1.0K

Yuda Song Retweeted

Sukjun (June) Hwang@sukjun_hwang · Jul 11

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

703

5.0K

4.0K

699.0K

Yuda Song Retweeted

Julia Kempe@KempeLab · Jul 10

Black-box Optimization for LLM Post-Training 💪 Strong non-vacuous generalization bounds ✔️ Privacy by design ✔️ Robustness to poisoning and data extraction ✔️ Improvement on reasoning benchmarks ✔️ @AIatMeta @NYUDataScience (1/8)

1.0K

Yuda Song Retweeted

Charles Arnal@arnal_charles · Jun 27

❓How to balance negative and positive rewards in off-policy RL❓ In Asymmetric REINFORCE for off-Policy RL, we show that giving less weight to negative rewards is enough to stabilize off-policy RL training for LLMs! 💪 (1/8) Paper: arxiv.org/abs/2506.20520

155

128

16.0K

Yuda Song Retweeted

Zhaolin Gao@GaoZhaolin · Jun 11

Current RLVR methods like GRPO and PPO require explicit critics or multiple generations per prompt, resulting in high computational and memory costs. We introduce ⭐A*-PO, a policy optimization algorithm that uses only a single sample per prompt during online RL without critic.

219

172

26.0K

Yuda Song@yus167 · Jun 10

In this paper we explore he we can efficiently scale the inference time compute for agents. Instead of blindly scaling the number of tokens at each step, it would be much better to scale the number of interactions! Check out how we did it!

JJack Bai@jackbai_jkb · Jun 10

🧵 1/7 Should AI agents "think more" or "do more"? 🤔 The current trend is to scale test-time compute, making agents generate longer reasoning traces. But what if that’s the wrong approach for interactive tasks? In our new work, we argue for a new scaling dimension: Test-Time…

2.0K

Yuda Song Retweeted

Gokul Swamy@g_k_swamy · Jun 10

Say ahoy to 𝚂𝙰𝙸𝙻𝙾𝚁⛵: a new paradigm of *learning to search* from demonstrations, enabling test-time reasoning about how to recover from mistakes w/o any additional human feedback! 𝚂𝙰𝙸𝙻𝙾𝚁 ⛵ out-performs Diffusion Policies trained via behavioral cloning on 5-10x data!

255

126

69.0K

Yuda Song@yus167 · Jun 3

SCA is the first self-improvement rl framework for general multi-turn tool-use agents. It does so by first generating its own verifiers for its own synthetic tasks. Stay tuned for more details!

JJason Weston@jaseweston · Jun 3

🚨Self-Challenging Language Model Agents🚨 📝: arxiv.org/abs/2506.01716 A new paradigm to train LLM agents to use different tools with challenging self-generated data ONLY: Self-challenging agents (SCA) both propose new tasks and solve them, using self-generated verifiers to…

5.0K

Yuda Song Retweeted

Andrea Zanette @ ICML 2025@Zanette_ai · May 28

Can Large Reasoning Models Self-Train? We propose Self-Rewarded Training (SRT)—where LLMs generate their own supervision. Main findings: SRT initially matches RL on ground truth, but sustained training risks reward hacking. We also investigate mitigation strategies.

6.0K

Yuda Song@yus167 · May 28

One fundamental issue with RL – whether it’s for robots or LLMs – is how hard it is to get rewards. For LLM reasoning, we need ground-truth labels to verify answers. We found that maximizing confidence alone allows LLMs to improve their reasoning with RL!

MMihir Prabhudesai@mihirp98 · May 28

Excited to share our work: Maximizing Confidence Alone Improves Reasoning Humans rely on confidence to learn when answer keys aren’t available (e.g taking an exam). Surprisingly, LLMs can also learn w/o ground-truth answers, simply by reinforcing high-confidence answers via RL!

129

11.0K

Yuda Song Retweeted

Fahim Tajwar@FahimTajwar10 · May 28

RL with verifiable reward has shown impressive results in improving LLM reasoning, but what can we do when we do not have ground truth answers? Introducing Self-Rewarding Training (SRT): where language models provide their own reward for RL training! 🧵 1/n

144

839

865

82.0K

Yuda Song Retweeted

Antoine Moulin@antoine_mln · May 27

new preprint with the amazing @LucaViano4 and @neu_rips on offline imitation learning! when the expert is hard to represent but the environment is simple, estimating a Q-value rather than the expert directly may be beneficial. there are many open questions left though!

7.0K

Yuda Song Retweeted

Yifei Zhou@YifeiZhou02 · May 18

With previous research in multimodal and agents, I believe the only truly useful multimodal agent before 2027 is multimodal co-creations in structured formats. Sharing my first blogpost, cuz I do not quite see this point of view around but can be quite impacful to the society.

114

13.0K

Yuda Song Retweeted

Dylan Foster 🐢@canondetortugas · May 5

Is Best-of-N really the best we can do for language model inference? New algo & paper: 🚨InferenceTimePessimism🚨 Led by the amazing Audrey Huang (@auddery) with Adam Block, Qinghua Liu, Nan Jiang (@nanjiang_cs), and Akshay Krishnamurthy. Appearing at ICML '25. 1/11

195

177

23.0K