Rohan Pandey
@khoomeik
RLpilled simulationmaxxer || prev research @OpenAI @CarnegieMellon '23
I’ve left OpenAI! Already miss everyone on the Training team & my friends ❤️ but very excited to soon announce what’s next Until then, I’ll be taking a break to solve OCR for Sanskrit so we can immortalize the classical Indian literary canon in the weights of superintelligence

much more empirical/interpy work needed to understand why RL with CoT is so much better than without (not looking for theoretical explanations like test-time scaling expressivity or latent variable expectation maximization)
pictured: non-reasoning model doing non-reasoning things
guys cmon pika is chill they're smart people they did not deserve to get ratioed fr tho sorry friends @ pika, didn't mean to send hate your way, but please do consider building civilization-enriching experiences and not "the child-eating short-form video blackhole" :)


i hate when people familiar with the discourse but who've never run an experiment in their life posture about "reading arxiv papers" or "doing research" i think i see a bit of my college freshman self in them

Prime Intellect’s “staff tweeting good stuff” game rivals the peak of OpenAI’s (which coincides with when I was there ofc) + may currently be SOTA
lots of alpha in retweeting old ilya/gdb bangers and baiting the cracked oldheads to come out of the woodwork and drop lore like this
Why I decided to do RL in 2016 after trying out MuProp on training a Neural Programmer and backprop failed me
For differentiable problems, there’s backpropagation. For everything else, there’s RL.
ML is like alchemy, turning silicon into spirits and ghosts
RL can train an NN policy to do pretty much anything in sim. With Automatic Domain Randomization, we can bridge the sim2real gap by training the policy to be so adaptible, that it can generalize to the physical robot. The result: it solves the Rubik's cube with a robot hand!
We've trained an AI system to solve the Rubik's Cube with a human-like robot hand. This is an unprecedented level of dexterity for a robot, and is hard even for humans to do. The system trains in an imperfect simulation and quickly adapts to reality: openai.com/blog/solving-r…
huge ideas that promise near-infinite gain in theory often lead to colossal disasters in practice
The Bitter lesson does not say to not bother with methods research. It says to not bother with methods that are handcrafted datapoints in disguise.
if you think the bitter lesson is invalidated by the “data wall”, please observe that we have compute->data transmogrifiers they’re called simulators