Ori Press

@ori_press

Graduate student @BethgeLab. I yearn to deep learn

Joined December 2018

388Following

408Followers

Pinned

Ori Press@ori_press · Jul 2

Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️

ori_press's tweet image. Do language models have algorithmic creativity?

To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️

154

22.0K

Ori Press Retweeted

Brandon Amos@brandondamos · Jul 8

Excited to release AlgoTune!! It's a benchmark and coding agent for optimizing the runtime of numerical code 🚀 algotune.io 📚 algotune.io/paper.pdf 🤖 github.com/oripress/AlgoT… with @OfirPress @ori_press @PatrickKidger @b_stellato @ArmanZharmagam1 & many others 🧵

180

13.0K

Ori Press Retweeted

Ofir Press@OfirPress · Jul 2

AlgoBench is extremely tough, with agents not finding substantial speedups on most tasks. But sometimes these agents do really cool things: here, the agent realized that it could solve this convex optimization problem with a scipy function, leading to an 81x speedup.

161

13.0K

Ori Press Retweeted

Alex Zhang@a1zhang · May 28

Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II? 𝗩𝗶𝗱𝗲𝗼𝗚𝗮𝗺𝗲𝗕𝗲𝗻𝗰𝗵 evaluates VLMs on Game Boy & MS-DOS games given only raw screen input, just like how a human would play. The best model (Gemini) completes just 0.48% of the benchmark! 🧵👇

540

232

119.0K

Ori Press@ori_press · Apr 17

Completing games requires long context and complex visual processing- so we put a bunch of 90s games into an emulator and made a benchmark. Our agent can't even the first level of these games. You can download it right now and try it out.

AAlex Zhang@a1zhang · Apr 17

Claude can play Pokemon, but can it play DOOM? With a simple agent, we let VLMs play it, and found Sonnet 3.7 to get the furthest, finding the blue room! Our VideoGameBench (twenty games from the 90s) and agent are open source so you can try it yourself now --> 🧵

8.0K

Ori Press Retweeted

Ofir Press@OfirPress · Feb 18

1.0K

Ori Press Retweeted

Kilian Lieret@KLieret · Feb 13

SWE-agent 1.0 is the open-source SOTA on SWE-bench Lite! Tons of new features: massively parallel runs; cloud-based deployment; extensive configurability with tool bundles; new command line interface & utilities.

10.0K

Ori Press@ori_press · Feb 4

Wow!🤯🤯

HHila Chefer@hila_chefer · Feb 4

VideoJAM is our new framework for improved motion generation from @AIatMeta We show that video generators struggle with motion because the training objective favors appearance over dynamics. VideoJAM directly adresses this **without any extra data or scaling** 👇🧵

345

Ori Press Retweeted

Hila Chefer@hila_chefer · Feb 4

201

1.0K

450

164.0K

Ori Press Retweeted

John Yang@jyangballin · Jan 16

SWE-bench Multimodal evaluation code is out now! SWE-bench MM is a new set of JavaScript issues that have a visual component (‘map isn’t rendering correctly’, ‘button text isn’t appearing’).

124

11.0K

Ori Press@ori_press · Dec 13

Today we're presenting our work on a task that LLMs still cannot do as well as humans, stop by our poster to find out more!

AAndreas Hochlehnert@ahochlehnert · Dec 13

We are presenting CiteMe today at the 11AM poster session (East Exhibit Hall A-C, #3309) #NeurIPS2024

458

Ori Press Retweeted

Ofir Press@OfirPress · Dec 10

We're presenting SWE-agent tomorrow (Wed) at the 11AM poster session, East Exhibit Hall A-C #1000. We're going to talk about a lot of upcoming SWE-agent features. Join @jyangballin @_carlosejimenez @KLieret and me. I also have a bunch of SWE-agent stickers to hand out :)

2.0K

Ori Press Retweeted

Ofir Press@OfirPress · Dec 4

I'm on the academic job market! I develop autonomous systems for: programming, research-level question answering, finding sec vulnerabilities & other useful+challenging tasks. I do this by building frontier-pushing benchmarks and agents that do well on them. See you at NeurIPS!

230

24.0K

Ori Press Retweeted

Vishaal Udandarao@vishaal_urao · Dec 2

🚀New Paper: Active Curation Effectively Distills Multimodal Models arxiv.org/abs/2411.18674 Smol models are all the rage & knowledge distillation (KD) is key for model compression! We show how data curation can effectively distill to yield SoTA FLOP-efficient {C/Sig}LIPs!! 🧵👇

8.0K