Uzay @ sf
@uzpg_
CS / Math / philosophy | ML research @ MIT | France-US
Hi! Here's some information about who I am, what I'm interested in, and some of my "values" — building on my personal wiki, website and general digital presence.
I plan on doing a big "who am i? what do i like? what am I interested in?" pinned tweet tomorrow, sort of as a tree of tweets instead of just a linked list. Very excited because I can't really express much about myself in my bio, nor have I really worked on it.
reading through a bunch of swebench logs a few months ago made me realize it wouldn't be a good measure of the kind of long horizon coding ability I needed for a research project, which is why we made breakpoint
SWE-bench Verified is one of the main benchmarks to assess AI coding skills. But what does it actually measure? We found that it's one of the best tests of AI coding, but limited by its focus on simple bug fixes in familiar repositories. Here’s a summary of our article 🧵
Crowdsourcing solutions to the world's highest-leverage problems this Saturday! Join us hacking on dynamic protein prediction, labor automation modeling, and acoustic window cranioplasties (altho I'm not sure how the last one will work) Links below
what are the current best systems or ideas for the human input continual learning described here?
Scaling up RL is all the rage right now, I had a chat with a friend about it yesterday. I'm fairly certain RL will continue to yield more intermediate gains, but I also don't expect it to be the full story. RL is basically "hey this happened to go well (/poorly), let me slightly…
we are going to start having self evolving software - user interaction patterns and inputs automatically dictate changes, at least in the frontend, quite soon and it's going to be pretty fun :)
doing frontend with claude code is so fun favorite use of a model
having this weird issue with claude where it repeatedly refuses to use the code execution tool I give it, preferring to embed its tool inside of its response in this strange way probably because of some internal RL on tool calling, kinda interesting

"I think nature's imagination is so much greater than man's, she's never going to let us relax" - richard feynman good thought for an ML researcher in this day and age
Breakpoint got into COLM! Will be in Montreal in October :) Also have been improving the library for ease of use, and getting it in production with different research orgs - getting it inspect compatible
@kaivu, @atticuswzf , and I were researching long horizon reasoning (with @jacobandreas). We found existing benchmarks’ hard problems often featured tricky puzzles, not tests of system understanding. So we made Breakpoint: a SWE benchmark designed to disambiguate this capability.
any actually useful books for advice and frameworks on getting shit done in the world or inspiring biographies