Karl Pertsch
@KarlPertsch
Robot Foundation Models @ UC Berkeley & Stanford & @physical_int | Postdoc w/ Sergey Levine & Chelsea Finn | Prev. Intern @ Google Brain, Meta AI | PhD @ USC.
We’re releasing the RoboArena today!🤖🦾 Fair & scalable evaluation is a major bottleneck for research on generalist policies. We’re hoping that RoboArena can help! We provide data, model code & sim evals for debugging! Submit your policies today and join the leaderboard! :) 🧵
It was time to improve our evaluations in robot learning! We introduce a methodology based on anonymous A/B testing: fairer, stronger, community-driven. Awesome work by @KarlPertsch @pranav_atreya @tonyh_lee and an incredible crowdsourcing team. Upload and test your model! 🚀
We’re releasing the RoboArena today!🤖🦾 Fair & scalable evaluation is a major bottleneck for research on generalist policies. We’re hoping that RoboArena can help! We provide data, model code & sim evals for debugging! Submit your policies today and join the leaderboard! :) 🧵
Check out some of our new work on distributed robot evaluation led by @KarlPertsch, @pranav_atreya and @tonyh_lee! Hopefully folks can contribute, and help us take a step towards systematic and standardized empiricism in robot learning! :) Also check out some of the fun sim eval…
We’re releasing the RoboArena today!🤖🦾 Fair & scalable evaluation is a major bottleneck for research on generalist policies. We’re hoping that RoboArena can help! We provide data, model code & sim evals for debugging! Submit your policies today and join the leaderboard! :) 🧵
🚀 We just launched RoboArena — a real-world evaluation platform for robot policies! Think Chatbot Arena, but for robotics. 📝 Paper: robo-arena.github.io/assets/roboare… 🌐 Website: robo-arena.github.io Joint work with @pranav_atreya and @KarlPertsch. advised by @percyliang,…
We’re releasing the RoboArena today!🤖🦾 Fair & scalable evaluation is a major bottleneck for research on generalist policies. We’re hoping that RoboArena can help! We provide data, model code & sim evals for debugging! Submit your policies today and join the leaderboard! :) 🧵
Final note: It is easier to work on robotics now than any point in the past.
I'll give a talk about benchmarking generalist policies today at RSS (4:30p, RTH 526, in the benchmarking workshop)! I will discuss sim eval, auto eval, and distributed real-world eval (ie RoboArena) -- swing by :)

In robotics benchmarks are rarely shared. New eval setups are created for each new project, a stark difference from evals in broader ML. But generalist policies share a problem statement: do any task in any environment. Can generalist capabilities make robot evaluation easier?