Fahim Tajwar

@FahimTajwar10

PhD Student @mldcmu @SCSatCMU BS/MS from @Stanford

Joined April 2021

351Following

621Followers

Pinned

Fahim Tajwar@FahimTajwar10 · May 28

RL with verifiable reward has shown impressive results in improving LLM reasoning, but what can we do when we do not have ground truth answers? Introducing Self-Rewarding Training (SRT): where language models provide their own reward for RL training! 🧵 1/n

FahimTajwar10's tweet image. RL with verifiable reward has shown impressive results in improving LLM reasoning, but what can we do when we do not have ground truth answers?

Introducing Self-Rewarding Training (SRT): where language models provide their own reward for RL training!

🧵 1/n

143

838

865

82.0K

Fahim Tajwar@FahimTajwar10 · Jul 17

Please checkout Gaurav's insanely cool work on memorization, if you are at ICML!

GGaurav Ghosal@gaurav_ghosal · Jul 17

1/So much of privacy research is designing post-hoc methods to make models mem. free. It’s time we turn that around with architectural changes. Excited to add Memorization Sinks to the transformer architecture this #ICML2025 to isolate memorization during LLM training🧵

372

Fahim Tajwar Retweeted

Gokul Swamy@g_k_swamy · Jul 15

Recent work has seemed somewhat magical: how can RL with *random* rewards make LLMs reason? We pull back the curtain on these claims and find out this unexpected behavior hinges on the inclusion of certain *heuristics* in the RL algorithm. Our blog post: tinyurl.com/heuristics-con…

477

425

81.0K

Fahim Tajwar Retweeted

Alex Robey@AlexRobey23 · Jul 14

On Monday, I'll be presenting a tutorial on jailbreaking LLMs + the security of AI agents with @HamedSHassani and @aminkarbasi at ICML. I'll be in Vancouver all week -- send me a DM if you'd like to chat about jailbreaking, AI agents, robots, distillation, or anything else!

6.0K

Fahim Tajwar@FahimTajwar10 · Jul 13

@abitha___ will be presenting our work on training language models to predict further into the future beyond the next token and the benefits this objective brings. x.com/gm8xx8/status/…

�𝚐𝔪𝟾𝚡𝚡𝟾@gm8xx8 · Apr 16

Looking beyond the next token TRELAWNEY inserts future tokens <T>...</T> during training to teach models to plan ahead—boosting reasoning, coherence, and control. Highlights: - NO ARCHITECTURE CHANGES. JUST SMARTER DATA. - works with standard decoding - enables controllable…

2.0K

Fahim Tajwar Retweeted

Yiding Jiang@yidingjiang · Jul 13

I will be at ICML next week. If you are interested in chatting about anything related to generalization, exploration, and algorithmic information theory + computation, please get in touch 😀 (DM or email)! My coauthors and I will be presenting 2 papers 👇:

6.0K

Fahim Tajwar@FahimTajwar10 · Jul 13

Please attend @yidingjiang 's oral presentation of our work, Paprika, at ICML!

YYiding Jiang@yidingjiang · Jul 13

I will talk about how to train agents with decision making capabilities that generalize to completely new environments: x.com/FahimTajwar10/…

1.0K

Fahim Tajwar Retweeted

Sukjun (June) Hwang@sukjun_hwang · Jul 11

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

686

5.0K

4.0K

693.0K

Fahim Tajwar Retweeted

Yiding Jiang@yidingjiang · Jun 26

A mental model I find useful: all data acquisition (web scrapes, synthetic data, RL rollouts, etc.) is really an exploration problem 🔍. This perspective has some interesting implications for where AI is heading. Wrote down some thoughts: yidingjiang.github.io/blog/post/expl…

428

396

36.0K

Fahim Tajwar Retweeted

Allen Nie (🇺🇦☮️✈️ICML)@allenainie · Jun 17

Decision-making with LLM can be studied with RL! Can an agent solve a task with text feedback (OS terminal, compiler, a person) efficiently? How can we understand the difficulty? We propose a new notion of learning complexity to study learning with language feedback only. 🧵👇

104

17.0K

Fahim Tajwar@FahimTajwar10 · Jun 16

Incredibly excited to share that Neural MP got accepted to IROS as an Oral presentation!! Huge congrats to the whole team (@Jiahui_Yang6709, @mendonca_rl , youssef khaky, @rsalakhu, @pathak2206), but especially @Jiahui_Yang6709 for making this happen after I graduated! This now…

MMurtaza Dalal@mihdalal · Sep 10

Can a single neural network policy generalize over poses, objects, obstacles, backgrounds, scene arrangements, in-hand objects, and start/goal states? Introducing Neural MP: A generalist policy for solving motion planning tasks in the real world 🤖 1/N

19.0K

Fahim Tajwar Retweeted

Gokul Swamy@g_k_swamy · Jun 10

Say ahoy to 𝚂𝙰𝙸𝙻𝙾𝚁⛵: a new paradigm of *learning to search* from demonstrations, enabling test-time reasoning about how to recover from mistakes w/o any additional human feedback! 𝚂𝙰𝙸𝙻𝙾𝚁 ⛵ out-performs Diffusion Policies trained via behavioral cloning on 5-10x data!

255

126

69.0K

Fahim Tajwar@FahimTajwar10 · Jun 1

In my experience, the details of RLHF matter a shocking amount. If you'd like to avoid solving a hard exploration problem, this RLHF tutorial might be of interest :)

MML@CMU@mlcmublog · Jun 1

blog.ml.cmu.edu/2025/06/01/rlh… In this in-depth coding tutorial, @GaoZhaolin and @g_k_swamy walk through the steps to train an LLM via RL from Human Feedback!

3.0K

Fahim Tajwar Retweeted

alphaXiv@askalphaxiv · May 30

"Can Large Reasoning Models Self-Train?" A brilliant paper from CMU showing LLMs can improve at math reasoning WITHOUT human labels - just learning from their own consistency. Early results rival models trained on ground-truth answers.

349

307

27.0K

Fahim Tajwar@FahimTajwar10 · May 30

This is really great work by Fahim and co, moving out of the regime where we have ground truth rewards is critical for the next level of RL scaling in LLMs

FFahim Tajwar@FahimTajwar10 · May 28

3.0K

Fahim Tajwar Retweeted

Sheikh Shafayat@shafayat_sheikh · May 29

Check out our latest work on self-improving LLMs, where we try to see if LLMs can utilize their internal self consistency as a reward signal to bootstrap itself using RL. TL;DR: it can, to some extent, but then ends up reward hacking the self-consistency objective. We try to see…

143

11.0K

Fahim Tajwar Retweeted

alphaXiv@askalphaxiv · May 29

This is pretty remarkable – AI systems learning to self-improve We're seeing a wave of research where AI isn't just learning from human feedback, it's starting to figure out how to improve itself using its own internal signals A subtle but profound shift.

115

584

491

42.0K

Fahim Tajwar Retweeted

Gaurav Ghosal@gaurav_ghosal · Jun 27, 2024

While LLMs contain extensive factual knowledge, they are also unreliable when answering questions downstream. In our #ICML2024 paper (arxiv.org/abs/2406.14785), we study the impact of QA finetuning and identify that the choice of fine-tuning data significantly affects factuality.

142

21.0K

Fahim Tajwar Retweeted

Intology@IntologyAI · May 28

The 1st fully AI-generated scientific discovery to pass the highest level of peer review – the main track of an A* conference (ACL 2025). Zochi, the 1st PhD-level agent. Beta open.

134

667

419

211.0K

Fahim Tajwar Retweeted

Mihir Prabhudesai@mihirp98 · May 28

Excited to share our work: Maximizing Confidence Alone Improves Reasoning Humans rely on confidence to learn when answer keys aren’t available (e.g taking an exam). Surprisingly, LLMs can also learn w/o ground-truth answers, simply by reinforcing high-confidence answers via RL!

285

264

66.0K