Tianyu Pang

@TianyuPang1

🇸🇬Research Scientist at Sea AI Lab @SeaGroup; 👨🏻‍🎓PhD/BS from @Tsinghua_Uni and ex-@MSFTResearch; 🛡️Trustworthy AI and Generative Models.

Singapore

Joined December 2019

308Following

1KFollowers

Pinned

Tianyu Pang@TianyuPang1 · Jan 31

🤔Can 𝐂𝐡𝐚𝐭𝐛𝐨𝐭 𝐀𝐫𝐞𝐧𝐚 be manipulated? 🚀Our paper shows that You Can Improve Model Rankings on Chatbot Arena by Vote Rigging! ⚒️Just a few hundred rigged votes are sufficient! 🫤Still trust Chatbot Arena rankings? Read more📚arxiv.org/pdf/2501.17858

TianyuPang1's tweet image. 🤔Can 𝐂𝐡𝐚𝐭𝐛𝐨𝐭 𝐀𝐫𝐞𝐧𝐚 be manipulated?

🚀Our paper shows that You Can Improve Model Rankings on Chatbot Arena by Vote Rigging!

⚒️Just a few hundred rigged votes are sufficient!

🫤Still trust Chatbot Arena rankings? Read more📚arxiv.org/pdf/2501.17858

109

15.0K

Tianyu Pang Retweeted

Zichen Liu@zzlccc · Jul 14

Though not attending #ICML2025 in person, I'm super excited to share 3 accepted papers: 1.🎊Best Paper Honorable Mention @ AI4MATH workshop: Understanding R1-Zero-Like Training: A Critical Perspective (a.k.a Dr. GRPO but I think the paper is more than this loss fix) 2. Main…

5.0K

Tianyu Pang Retweeted

Kevin Lu@_kevinlu · Jul 9

Why you should stop working on RL research and instead work on product // The technology that unlocked the big scaling shift in AI is the internet, not transformers I think it's well known that data is the most important thing in AI, and also that researchers choose not to work…

152

2.0K

811.0K

Tianyu Pang Retweeted

Grad@Grad62304977 · Jun 22

I’d say maybe the way to go is to rl train with truncation in mind (for denser rewards) like arxiv.org/abs/2505.13438. Then at inference ur model should naturally handle truncation (even to 0 for no thinking)

4.0K

Tianyu Pang Retweeted

Ricky T. Q. Chen@RickyTQChen · Jun 11

Padding in our non-AR sequence models? Yuck. 🙅 👉 Instead of unmasking, our new work *Edit Flows* perform iterative refinements via position-relative inserts and deletes, operations naturally suited for variable-length sequence generation. Easily better than using mask tokens.

516

343

39.0K

Tianyu Pang Retweeted

Victor.Kai Wang@VictorKaiWang1 · Jun 20

Customizing Your LLMs in seconds using prompts🥳! Excited to share our latest work with @HPCAILab, @VITAGroupUT, @k_schuerholt, @YangYou1991, @mmbronstein, @damianborth : Drag-and-Drop LLMs(DnD). 2 features: tuning-free, comparable or even better than full-shot tuning.(🧵1/8)

102

14.0K

Tianyu Pang Retweeted

Dongfu Jiang@DongfuJiang · Jun 1

Introducing VerlTool - a unified and easy-to-extend tool agent training framework based on verl. Recently, there's been a growing trend toward training tool agents with reinforcement learning algorithms like GRPO and PPO. Representative works include SearchR1, ToRL, ReTool, and…

343

245

47.0K

Tianyu Pang@TianyuPang1 · May 31

We do appreciate their efforts in writing the criticisms, but “turns out that the results in this paper are misreported” is a strong claim without running evaluation themselves. Such claim was also generalized to many other papers in a more recent blog (safe-lip-9a8.notion.site/Incorrect-Base…),…

SShashwat Goel@ShashwatGoel7 · May 28

So turns out that the results in this paper are misreported. Qwen3 report shows 4b thinking is GPQA 55.9, much higher than both the 32 here, and post-RL numbers of 45. Similarly for other datasets. No wonder our RL claims keep falling apart. Noticed by @nikhilchandak29

20.0K

Tianyu Pang@TianyuPang1 · May 30

🧠 How can we foster Temporal Reasoning in Videos? 📽️ Inspired by LLMs’ next-token prediction, we propose Next-Event Prediction (NEP), a self-supervised task that teaches MLLMs to reason the future in videos. ⚡️Just like Doctor Strange, our models see what's next!

HHaonan Wang@Haonan_Wang_ · May 30

🎦 Fostering Video Reasoning through Next-Event Prediction 🎞️Introducing Next-Event Prediction (NEP), a self-supervised task enabling MLLMs to reason temporally by predicting future events from past video frames, like foreseeing scenes before they unfold. 🌟 Why Next-Event…

6.0K

Tianyu Pang@TianyuPang1 · May 28

Reinforcing General Reasoning without (External) Verifiers 🧐 Your model likelihood on reference answers is secretly a reliable reward signal for general reasoning! More details 👇

ZZichen Liu@zzlccc · May 28

Reinforcing General Reasoning without Verifiers 🈚️ R1-Zero-like RL thrives in domains with verifiable rewards (code, math). But real-world reasoning (chem, bio, econ…) lacks easy rule-based verifiers — and model-based verifiers add complexity. Introducing *VeriFree*: ⚡ Skip…

134

11.0K

Tianyu Pang Retweeted

Tanishq Abraham is at ICML@iScienceLuvr · May 28

Reinforcing General Reasoning without Verifiers "we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and…

311

229

25.0K

Tianyu Pang@TianyuPang1 · May 27

🚨Safety-aligned LLMs are getting jailbroken by new unseen attacks. 💡We propose Lifelong Safety Alignment: - Meta-Attacker develops unseen jailbreaks via reasoning; - Defender learns to resist them. 📚Iterative adversarial-play evolution leads to both strong Meta-Attacker and…

TianyuPang1's tweet image. 🚨Safety-aligned LLMs are getting jailbroken by new unseen attacks.

💡We propose Lifelong Safety Alignment:
- Meta-Attacker develops unseen jailbreaks via reasoning;
- Defender learns to resist them.

📚Iterative adversarial-play evolution leads to both strong Meta-Attacker and…

662

Tianyu Pang Retweeted

Dongfu Jiang@DongfuJiang · May 24

Introducing QuickVideo, 🚀 speeding up the end-to-end time from the mp4 bit stream to VideoLLM inference by at least 2.5 times for hour-long video understanding (e.g 1024 frames) on a single 40GB GPU. 🤔What are the key challenges of hour-long video understanding? 1.…

169

107

24.0K

Tianyu Pang Retweeted

Chongxuan Li@LiChongxuan · May 22

🚀 Excited to share our latest work: "Scaling Diffusion Transformers Efficiently via μP"! Diffusion Transformers are essential in visual generative models, but hyperparameter tuning for scaling remains challenging. We adapt μP, proving it also applies to diffusion Transformers!

4.0K

Tianyu Pang@TianyuPang1 · May 21

I believe our work offers a promising technique for building reasoners with fine-grained budget control— like what Gemini 2.5 Pro has just introduced. Truncated at any time, we deliver the best-effort solution!

ZZichen Liu@zzlccc · May 20

Anytime reasoning, a topic dating back to the ’90s, seeks the best-effort solution given a computation bound. For large reasoning models, we optimize anytime reasoning with 1) dense rewards and 2) better credit assignment (BRPO). For more details👇

3.0K

Tianyu Pang Retweeted

Alexia Jolicoeur-Martineau@jm_alexia · May 20

I tried TRL again... I'm going back to OAT. Every time I try to use TRL, its always a nightmare. OAT is plug and play. github.com/sail-sg/oat

2.0K

Tianyu Pang Retweeted

AK@_akhaliq · May 21

Optimizing Anytime Reasoning via Budget Relative Policy Optimization

14.0K

Tianyu Pang@TianyuPang1 · May 20

🪇We propose Budget Relative Policy Optimization (BRPO) to facilitate Anytime Reasoning! For more details👇

PPenghui Qi@QPHutu · May 20

👀Optimizing Anytime Reasoning via Budget Relative Policy Optimization👀 🚀Our BRPO leverages verifiable dense rewards, significantly outperforming GRPO in both final and anytime reasoning performance.🚀 📰Paper: arxiv.org/abs/2505.13438 🛠️Code: github.com/sail-sg/Anytim…

3.0K

Tianyu Pang@TianyuPang1 · May 16

Very interesting analysis! However, we found that that you can actually achieve the same performance on the same ONE example with a variant of SFT -> CFT (critique fine-tuning) arxiv.org/abs/2501.17703. It's much much much faster than RL on ONE example! Here is a teaser for our…

AAlex Dimakis@AlexGDimakis · May 10

"RL with only one training example" and "Test-Time RL" are two recent papers that I found fascinating. In the "One Training example" paper the authors find one question and ask the model to solve it again and again. Every time, the model tries 8 times (the Group in GRPO), and…

234

163

32.0K

Tianyu Pang@TianyuPang1 · Apr 27

Great to see Dr. GRPO is much more sample efficient than the original GRPO

TTanishq Abraham is at ICML@iScienceLuvr · Apr 23

Tina: Tiny Reasoning Models via LoRA "the best Tina model achieves a >20% reasoning performance increase and 43.33% Pass@1 accuracy on AIME24, at only $9 USD post-training and evaluation cost (i.e., an estimated 260x cost reduction). Our work reveals the surprising effectiveness…

112

9.0K

Tianyu Pang Retweeted

�

𝚐𝔪𝟾𝚡𝚡𝟾@gm8xx8 · Apr 22

FlowReasoner: Reinforcing Query-Level Meta-Agents - distilled from DeepSeek-R1 - RL with execution feedback (perf/complexity/efficiency) - builds agent teams per query - +10.52% accuracy over o1-mini across 3 code benchmarks - outperforms on eng + competition tasks - replaces…

120

11.0K