XLANG NLP Lab (@XLangNLP)

X

😀Our initial leaderboard finally came out, here I'd like to share a few interesting findings based on our case study: 1, Claude 3.7 Sonnet consistently performs best across diverse task types, particularly excelling at open-ended queries like “write a paper reading report.” 2,…

XXLANG NLP Lab@XLangNLP · May 2

🏆 Leaderboard Update! 🚀 Claude 3.7 Sonnet from @AnthropicAI ties #1 in Computer Agent Arena, followed by Operator from @OpenAI & UI-TARS-1.5 from @BytedanceTalk, which is significantly different from prior benchmarks! Check the full rankings! 👉 arena.xlang.ai/leaderboard

1

5

14

7

2.0K

X

XLANG NLP Lab@XLangNLP · May 2

🏆 Leaderboard Update! 🚀 Claude 3.7 Sonnet from @AnthropicAI ties #1 in Computer Agent Arena, followed by Operator from @OpenAI & UI-TARS-1.5 from @BytedanceTalk, which is significantly different from prior benchmarks! Check the full rankings! 👉 arena.xlang.ai/leaderboard

XLangNLP's tweet image. 🏆 Leaderboard Update!
🚀 Claude 3.7 Sonnet from @AnthropicAI ties #1 in Computer Agent Arena, followed by Operator from @OpenAI &amp; UI-TARS-1.5 from @BytedanceTalk, which is significantly different from prior benchmarks!

Check the full rankings! 👉 arena.xlang.ai/leaderboard

2

23

89

29

29.0K

X

XLANG NLP Lab@XLangNLP · Apr 24

🎉 UI-TARS-1.5 is now live on Computer Agent Arena! Currently the SOTA model across multiple GUI benchmarks, showcasing leading performance in computer use, browser use, and even gameplay. Want to try the most intelligent CUA so far? Go to arena.xlang.ai.

YYujia Qin@TsingYoga · Apr 17

Introducing UI-TARS-1.5, a vision-language model that beats OpenAI Operator and Claude 3.7 on GUI Agent and Game Agent tasks. We've open-sourced a small-size version model for research purposes, more details can be found in our blog. TARS learns solely from a screen, but…

0

7

17

2

3.0K

X

XLANG NLP Lab@XLangNLP · Apr 24

🚀 Exciting news! @OpenAI's o3 & o4-mini, the most capable reasoning models, are now live on Computer Agent Arena! Test, vote, and explore their full potential with CUAs at arena.xlang.ai! Join the community and dive in!

OOpenAI@OpenAI · Apr 16

Introducing OpenAI o3 and o4-mini—our smartest and most capable models to date. For the first time, our reasoning models can agentically use and combine every tool within ChatGPT, including web search, Python, image analysis, file interpretation, and image generation.

2

4

14

2

5.0K

X

XLANG NLP Lab@XLangNLP · Apr 8

Check out Computer Use Agent Arena, an exciting new launch by OSWorld team @XLangNLP!

BBowen Wang@BowenWangNLP · Apr 8

🎮 Computer Use Agent Arena is LIVE! 🚀 🔥 Easiest way to test computer-use agents in the wild without any setup 🌟 Compare top VLMs: OpenAI Operator, Claude 3.7, Gemini 2.5 Pro, Qwen 2.5 vl and more 🕹️ Test agents on 100+ real apps & webs with one-click config 🔒 Safe & free…

2

7

136

25

17.0K

X

XLANG NLP Lab@XLangNLP · Apr 8

Finally we are here! 👏 Check out our most open & fair benchmark⚔️ for computer use capability evaluation for the community.

BBowen Wang@BowenWangNLP · Apr 8

🎮 Computer Use Agent Arena is LIVE! 🚀 🔥 Easiest way to test computer-use agents in the wild without any setup 🌟 Compare top VLMs: OpenAI Operator, Claude 3.7, Gemini 2.5 Pro, Qwen 2.5 vl and more 🕹️ Test agents on 100+ real apps & webs with one-click config 🔒 Safe & free…

8

6

30

2

2.0K

X

XLANG NLP Lab@XLangNLP · Apr 8

👉Compare and test Computer Use Agents (Operator, Claude 3.7...) on any kinds of tasks in real computers 🚩without any setup and cost🚩! Try our Computer Use Agent Arena: arena.xlang.ai

BBowen Wang@BowenWangNLP · Apr 8

🎮 Computer Use Agent Arena is LIVE! 🚀 🔥 Easiest way to test computer-use agents in the wild without any setup 🌟 Compare top VLMs: OpenAI Operator, Claude 3.7, Gemini 2.5 Pro, Qwen 2.5 vl and more 🕹️ Test agents on 100+ real apps & webs with one-click config 🔒 Safe & free…

0

8

0

378

XLANG NLP Lab Retweeted

B

Bowen Wang@BowenWangNLP · Apr 8

🎮 Computer Use Agent Arena is LIVE! 🚀 🔥 Easiest way to test computer-use agents in the wild without any setup 🌟 Compare top VLMs: OpenAI Operator, Claude 3.7, Gemini 2.5 Pro, Qwen 2.5 vl and more 🕹️ Test agents on 100+ real apps & webs with one-click config 🔒 Safe & free…

14

105

335

210

88.0K

X

XLANG NLP Lab@XLangNLP · Apr 8

🚀After a year of development based on our OSWorld, Computer Use Agent Arena is LIVE! Test top AI agents (Operator, Claude 3.7...) on any kinds of computer use tasks with zero setup. Cloud-hosted, safe, and FREE! Try it now: arena.xlang.ai ! Data & code coming soon!

BBowen Wang@BowenWangNLP · Apr 8

🎮 Computer Use Agent Arena is LIVE! 🚀 🔥 Easiest way to test computer-use agents in the wild without any setup 🌟 Compare top VLMs: OpenAI Operator, Claude 3.7, Gemini 2.5 Pro, Qwen 2.5 vl and more 🕹️ Test agents on 100+ real apps & webs with one-click config 🔒 Safe & free…

5

22

101

22

14.0K