Aradhye Agarwal
@AradhyeAgarwal
Incoming Research Fellow at Microsoft Research, CSE @ IIT Delhi, scaling test-time compute for LLMs
For the past couple of months we've been working on test-time scaling, and we've discovered a huge thing:
My two cents on RL in its current form.
I think simply performing RL with the final answer in mind is problematic since it might result in the LLM learning to "game" the reward. We would like the model to produce "correct" reasoning traces in addition to the correct answer, but without checking intermediate steps,…
In Prof. Alex's defense, "short is correct" and "correct is short" mean the same, which we analytically show in our recent paper. Check out the attached snip from our paper: arxiv.org/abs/2505.18149
Yup, my student correcting my Bayesian fallacy here. We care about P(correct /short ) not P(short/correct) because at inference time we don’t obviously know what is correct. This is also why (best of K ) is not something you can do at inference time, if you don’t know what the…