Yuanhe Zhang
@yuanhezhang6
Pragmatic Learning Theory, using tools from probability and Statistics | PhD in Stats @warwickstats 🇬🇧 | MMathStat @warwickstats 🇬🇧
(1/n) 🚀Thrill to share our LoRA-One work (arxiv.org/abs/2502.01235) as #ICML25 𝐨𝐫𝐚𝐥 𝐩𝐫𝐞𝐬𝐞𝐧𝐭𝐚𝐭𝐢𝐨𝐧, w. Fanghui @Fanghui_SgrA (Warwick) and Yudong (Madison). Oral @ West Ballroom B, 4pm at July 17th Poster @ West Exhibition Hall B2-B3 #W 905, 4:30PM at July 15th
Why does RL struggle with long reasoning chains? Because finding the correct solution by chance is exponentially rare. Solution: break down the complexity of the problem somehow, and ease into it adaptively! We propose AdaBack: an adaptive backtracking method that conditions…
Why does RL struggle with tasks requiring long reasoning chains? Because “bumping into” a correct solution becomes exponentially less likely as the number of reasoning steps grows. We propose an adaptive backtracking algorithm: AdaBack. 1/n
这周末读了context engineering的论文。个人感觉,主要方法非常接近RAG的4R: Retriver, Rewriter, Rranker, Reader. CE中关于memory,tool call response的方法,或多或少被4R 覆盖。 而4R中,我最喜欢围绕 Rewriter 的工作。 即,系统处理的query,并不一定是用户最初的query。…
A beautiful visual blog, where you can change values, interact, and see what each head does exactly inside the transformer.
Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇
「 Data Contamination,Qwen2.5 」 Qwen2.5 系列的 Data Contamination 问题被证实,模型在预训练阶段已经见过评测题目。 前几个月,数篇 LLM Reasoning + RL 的论文发现,用极弱或随机奖励即可显著提升 Qwen 系列数学推理能力。 这引发出 Qwen 模型在 pretraining 阶段已经见过评测题目的疑问。…
Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution? ift.tt/UnVtw01
Gradient Descent Algorithm in Hilbert Spaces under Stationary Markov Chains with $\phi$- and $\beta$-Mixing ift.tt/0mzSlCO
I'm not sure if someone has already pointed this out, but Dr. GRPO still has a bias that is more pronounced the smaller the group size is. To make it unbiased, simply multiply Dr. GRPO's A_i by the correction term N/N-1. With this, you'll get LOOP (Leave-One-Out Proximal Policy…
🪂Understanding R1-Zero-Like Training: A Critical Perspective * DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning?? * The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO?? * Getting GRPO Done Right, we achieve a 7B AIME sota! 🧵 📜Full…
It was a dream come true to teach the course I wish existed at the start of my PhD. We built up the algorithmic foundations of modern-day RL, imitation learning, and RLHF, going deeper than the usual "grab bag of tricks". All 25 lectures + 150 pages of notes are now public! 🧵
There are several hypotheses for why Adam outperforms SGD on LLMs: heavy-tailed noise, blowing up curvature, near-constant magnitude of update, etc. The one I find most compelling is label imbalance: Adam specifically improves performance on rare classes, of which there are many.
How can we quantify uncertainty in LLMs from only a few sampled outputs? The key lies in the classical problem of missing mass—the probability of unseen outputs. This perspective offers a principled foundation for conformal prediction in query-only settings like LLMs.