Bowen Wang
@BowenWangNLP
1st year Ph.D. student @XLangNLP @HKUniversity focusing on #NLP. Prev. @Tsinghua_Uni, passionate about computer-use agents.
🎮 Computer Use Agent Arena is LIVE! 🚀 🔥 Easiest way to test computer-use agents in the wild without any setup 🌟 Compare top VLMs: OpenAI Operator, Claude 3.7, Gemini 2.5 Pro, Qwen 2.5 vl and more 🕹️ Test agents on 100+ real apps & webs with one-click config 🔒 Safe & free…
Big congrats to @TsingYoga and their team for pushing the boundaries of CUAs! When developing, UI-TARS-1.5 truly feels like the beginning of a new chapter — the next episode is coming. Stay tuned for the leaderboard🚀!
🎉 UI-TARS-1.5 is now live on Computer Agent Arena! Currently the SOTA model across multiple GUI benchmarks, showcasing leading performance in computer use, browser use, and even gameplay. Want to try the most intelligent CUA so far? Go to arena.xlang.ai.
Meet Kimi-Researcher - an autonomous agent that excels at multi-turn search and reasoning. Powered by k 1.5 and trained with end-to-end agentic RL. Achieved 26.9% pass@1 on Humanity's Last Exam, 69% pass@1 on xbench. 🔗 Tech blog:moonshotai.github.io/Kimi-Researche…
🔥New Computer Agent Arena Leaderboard Updates (2k+ user votes)! 🤔Which VLMs act better as computer use agents (CUAs)? 1, Claude Sonnet 4 🥇 2, Claude 3.7 Sonnet 🥈 3, UI-TARS-1.5 🥉 4, Operator More insights in the thread 👇 arena.xlang.ai
Operator 🤝 OpenAI o3 Operator in ChatGPT has been updated with our latest reasoning model. operator.chatgpt.com
Based on my own testing, Claude 4 is even stronger in CUA than Claude 3.7 Sonnet, with agentic capabilities enhanced, come on and give it a try!
💠Claude Opus 4 & Claude Sonnet 4 Welcome to the Computer Agent Arena🔥 Congratulations on the @AnthropicAI team for the great release!
🤔Static CUA benchmarks enable fast model dev but lack task variety and risk overfitting. Computer Agent Arena tests crowdsourced real-world tasks. OSWorld: 🥇UI-Tars1.5🥈Operator🥉Claude 3.7 CUA Arena: 🥇Claude 3.7🥈Operator🥉UI-Tars1.5 🚀Rankings likely to evolve quickly
🏆 Leaderboard Update! 🚀 Claude 3.7 Sonnet from @AnthropicAI ties #1 in Computer Agent Arena, followed by Operator from @OpenAI & UI-TARS-1.5 from @BytedanceTalk, which is significantly different from prior benchmarks! Check the full rankings! 👉 arena.xlang.ai/leaderboard
😀Our initial leaderboard finally came out, here I'd like to share a few interesting findings based on our case study: 1, Claude 3.7 Sonnet consistently performs best across diverse task types, particularly excelling at open-ended queries like “write a paper reading report.” 2,…
🏆 Leaderboard Update! 🚀 Claude 3.7 Sonnet from @AnthropicAI ties #1 in Computer Agent Arena, followed by Operator from @OpenAI & UI-TARS-1.5 from @BytedanceTalk, which is significantly different from prior benchmarks! Check the full rankings! 👉 arena.xlang.ai/leaderboard
Part 2 of Build Your Own Operator on macOS is now live! The new cua-agent framework cuts down complexity and accelerates CUA development - so you can focus on building, not boilerplate.
For folks working on CUAs, definitely give o3 and o4-mini a try from @OpenAI. Key takeaway: Enhancing image reasoning and tool-use abilities on FM could significantly boost CUA performances.
🚀 Exciting news! @OpenAI's o3 & o4-mini, the most capable reasoning models, are now live on Computer Agent Arena! Test, vote, and explore their full potential with CUAs at arena.xlang.ai! Join the community and dive in!
🚀 Meet Kimi-VL and Kimi-VL-Thinking! 🌟 Our latest open source lightweight yet powerful Vision-Language Model with reasoning capability. ✨ Key Highlights: 💡 An MoE VLM and an MoE Reasoning VLM with only ~3B activated parameters 🧠 Strong multimodal reasoning (36.8% on…
I want to highlight that this was an incredibly complex piece of work put together by @BowenWangNLP. We have been working on this for more than a year - much longer than a typical 3-5 month AI sprint. Big thanks to @taoyds for leading such an impactful project.
🎮 Computer Use Agent Arena is LIVE! 🚀 🔥 Easiest way to test computer-use agents in the wild without any setup 🌟 Compare top VLMs: OpenAI Operator, Claude 3.7, Gemini 2.5 Pro, Qwen 2.5 vl and more 🕹️ Test agents on 100+ real apps & webs with one-click config 🔒 Safe & free…