Ori Press
@ori_press
Graduate student @BethgeLab. I yearn to deep learn
Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️

Excited to release AlgoTune!! It's a benchmark and coding agent for optimizing the runtime of numerical code 🚀 algotune.io 📚 algotune.io/paper.pdf 🤖 github.com/oripress/AlgoT… with @OfirPress @ori_press @PatrickKidger @b_stellato @ArmanZharmagam1 & many others 🧵
AlgoBench is extremely tough, with agents not finding substantial speedups on most tasks. But sometimes these agents do really cool things: here, the agent realized that it could solve this convex optimization problem with a scipy function, leading to an 81x speedup.
Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II? 𝗩𝗶𝗱𝗲𝗼𝗚𝗮𝗺𝗲𝗕𝗲𝗻𝗰𝗵 evaluates VLMs on Game Boy & MS-DOS games given only raw screen input, just like how a human would play. The best model (Gemini) completes just 0.48% of the benchmark! 🧵👇
Completing games requires long context and complex visual processing- so we put a bunch of 90s games into an emulator and made a benchmark. Our agent can't even the first level of these games. You can download it right now and try it out.
Claude can play Pokemon, but can it play DOOM? With a simple agent, we let VLMs play it, and found Sonnet 3.7 to get the furthest, finding the blue room! Our VideoGameBench (twenty games from the 90s) and agent are open source so you can try it yourself now --> 🧵
SWE-agent 1.0 is the open-source SOTA on SWE-bench Lite! Tons of new features: massively parallel runs; cloud-based deployment; extensive configurability with tool bundles; new command line interface & utilities.
Wow!🤯🤯
VideoJAM is our new framework for improved motion generation from @AIatMeta We show that video generators struggle with motion because the training objective favors appearance over dynamics. VideoJAM directly adresses this **without any extra data or scaling** 👇🧵
VideoJAM is our new framework for improved motion generation from @AIatMeta We show that video generators struggle with motion because the training objective favors appearance over dynamics. VideoJAM directly adresses this **without any extra data or scaling** 👇🧵
SWE-bench Multimodal evaluation code is out now! SWE-bench MM is a new set of JavaScript issues that have a visual component (‘map isn’t rendering correctly’, ‘button text isn’t appearing’).
Today we're presenting our work on a task that LLMs still cannot do as well as humans, stop by our poster to find out more!
We are presenting CiteMe today at the 11AM poster session (East Exhibit Hall A-C, #3309) #NeurIPS2024
We're presenting SWE-agent tomorrow (Wed) at the 11AM poster session, East Exhibit Hall A-C #1000. We're going to talk about a lot of upcoming SWE-agent features. Join @jyangballin @_carlosejimenez @KLieret and me. I also have a bunch of SWE-agent stickers to hand out :)
I'm on the academic job market! I develop autonomous systems for: programming, research-level question answering, finding sec vulnerabilities & other useful+challenging tasks. I do this by building frontier-pushing benchmarks and agents that do well on them. See you at NeurIPS!
🚀New Paper: Active Curation Effectively Distills Multimodal Models arxiv.org/abs/2411.18674 Smol models are all the rage & knowledge distillation (KD) is key for model compression! We show how data curation can effectively distill to yield SoTA FLOP-efficient {C/Sig}LIPs!! 🧵👇