Tianyu Pang
@TianyuPang1
🇸🇬Research Scientist at Sea AI Lab @SeaGroup; 👨🏻🎓PhD/BS from @Tsinghua_Uni and ex-@MSFTResearch; 🛡️Trustworthy AI and Generative Models.
🤔Can 𝐂𝐡𝐚𝐭𝐛𝐨𝐭 𝐀𝐫𝐞𝐧𝐚 be manipulated? 🚀Our paper shows that You Can Improve Model Rankings on Chatbot Arena by Vote Rigging! ⚒️Just a few hundred rigged votes are sufficient! 🫤Still trust Chatbot Arena rankings? Read more📚arxiv.org/pdf/2501.17858

Though not attending #ICML2025 in person, I'm super excited to share 3 accepted papers: 1.🎊Best Paper Honorable Mention @ AI4MATH workshop: Understanding R1-Zero-Like Training: A Critical Perspective (a.k.a Dr. GRPO but I think the paper is more than this loss fix) 2. Main…
Why you should stop working on RL research and instead work on product // The technology that unlocked the big scaling shift in AI is the internet, not transformers I think it's well known that data is the most important thing in AI, and also that researchers choose not to work…
I’d say maybe the way to go is to rl train with truncation in mind (for denser rewards) like arxiv.org/abs/2505.13438. Then at inference ur model should naturally handle truncation (even to 0 for no thinking)
Padding in our non-AR sequence models? Yuck. 🙅 👉 Instead of unmasking, our new work *Edit Flows* perform iterative refinements via position-relative inserts and deletes, operations naturally suited for variable-length sequence generation. Easily better than using mask tokens.
Customizing Your LLMs in seconds using prompts🥳! Excited to share our latest work with @HPCAILab, @VITAGroupUT, @k_schuerholt, @YangYou1991, @mmbronstein, @damianborth : Drag-and-Drop LLMs(DnD). 2 features: tuning-free, comparable or even better than full-shot tuning.(🧵1/8)
Introducing VerlTool - a unified and easy-to-extend tool agent training framework based on verl. Recently, there's been a growing trend toward training tool agents with reinforcement learning algorithms like GRPO and PPO. Representative works include SearchR1, ToRL, ReTool, and…
We do appreciate their efforts in writing the criticisms, but “turns out that the results in this paper are misreported” is a strong claim without running evaluation themselves. Such claim was also generalized to many other papers in a more recent blog (safe-lip-9a8.notion.site/Incorrect-Base…),…
So turns out that the results in this paper are misreported. Qwen3 report shows 4b thinking is GPQA 55.9, much higher than both the 32 here, and post-RL numbers of 45. Similarly for other datasets. No wonder our RL claims keep falling apart. Noticed by @nikhilchandak29
🧠 How can we foster Temporal Reasoning in Videos? 📽️ Inspired by LLMs’ next-token prediction, we propose Next-Event Prediction (NEP), a self-supervised task that teaches MLLMs to reason the future in videos. ⚡️Just like Doctor Strange, our models see what's next!
🎦 Fostering Video Reasoning through Next-Event Prediction 🎞️Introducing Next-Event Prediction (NEP), a self-supervised task enabling MLLMs to reason temporally by predicting future events from past video frames, like foreseeing scenes before they unfold. 🌟 Why Next-Event…
Reinforcing General Reasoning without (External) Verifiers 🧐 Your model likelihood on reference answers is secretly a reliable reward signal for general reasoning! More details 👇
Reinforcing General Reasoning without Verifiers 🈚️ R1-Zero-like RL thrives in domains with verifiable rewards (code, math). But real-world reasoning (chem, bio, econ…) lacks easy rule-based verifiers — and model-based verifiers add complexity. Introducing *VeriFree*: ⚡ Skip…
Reinforcing General Reasoning without Verifiers "we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and…
🚨Safety-aligned LLMs are getting jailbroken by new unseen attacks. 💡We propose Lifelong Safety Alignment: - Meta-Attacker develops unseen jailbreaks via reasoning; - Defender learns to resist them. 📚Iterative adversarial-play evolution leads to both strong Meta-Attacker and…

Introducing QuickVideo, 🚀 speeding up the end-to-end time from the mp4 bit stream to VideoLLM inference by at least 2.5 times for hour-long video understanding (e.g 1024 frames) on a single 40GB GPU. 🤔What are the key challenges of hour-long video understanding? 1.…
🚀 Excited to share our latest work: "Scaling Diffusion Transformers Efficiently via μP"! Diffusion Transformers are essential in visual generative models, but hyperparameter tuning for scaling remains challenging. We adapt μP, proving it also applies to diffusion Transformers!
I believe our work offers a promising technique for building reasoners with fine-grained budget control— like what Gemini 2.5 Pro has just introduced. Truncated at any time, we deliver the best-effort solution!
Anytime reasoning, a topic dating back to the ’90s, seeks the best-effort solution given a computation bound. For large reasoning models, we optimize anytime reasoning with 1) dense rewards and 2) better credit assignment (BRPO). For more details👇
I tried TRL again... I'm going back to OAT. Every time I try to use TRL, its always a nightmare. OAT is plug and play. github.com/sail-sg/oat
Optimizing Anytime Reasoning via Budget Relative Policy Optimization
🪇We propose Budget Relative Policy Optimization (BRPO) to facilitate Anytime Reasoning! For more details👇
👀Optimizing Anytime Reasoning via Budget Relative Policy Optimization👀 🚀Our BRPO leverages verifiable dense rewards, significantly outperforming GRPO in both final and anytime reasoning performance.🚀 📰Paper: arxiv.org/abs/2505.13438 🛠️Code: github.com/sail-sg/Anytim…
Very interesting analysis! However, we found that that you can actually achieve the same performance on the same ONE example with a variant of SFT -> CFT (critique fine-tuning) arxiv.org/abs/2501.17703. It's much much much faster than RL on ONE example! Here is a teaser for our…
"RL with only one training example" and "Test-Time RL" are two recent papers that I found fascinating. In the "One Training example" paper the authors find one question and ask the model to solve it again and again. Every time, the model tries 8 times (the Group in GRPO), and…
Great to see Dr. GRPO is much more sample efficient than the original GRPO
Tina: Tiny Reasoning Models via LoRA "the best Tina model achieves a >20% reasoning performance increase and 43.33% Pass@1 accuracy on AIME24, at only $9 USD post-training and evaluation cost (i.e., an estimated 260x cost reduction). Our work reveals the surprising effectiveness…
FlowReasoner: Reinforcing Query-Level Meta-Agents - distilled from DeepSeek-R1 - RL with execution feedback (perf/complexity/efficiency) - builds agent teams per query - +10.52% accuracy over o1-mini across 3 code benchmarks - outperforms on eng + competition tasks - replaces…