Zeyi Liao (@LiaoZeyi)

Pinned

Z

Zeyi Liao@LiaoZeyi · May 30

⁉️Can you really trust Computer-Use Agents (CUAs) to control your computer⁉️ Not yet, @AnthropicAI Opus 4 shows an alarming 48% Attack Success Rate against realistic internet injection❗️ Introducing RedTeamCUA: realistic, interactive, and controlled sandbox environments for…

2

32

82

24

22.0K

Zeyi Liao Retweeted

J

Jianyang Gu@vimar_gu · Jul 23

Announcing the @NeurIPSConf 2025 workshop on Imageomics: Discovering Biological Knowledge from Images Using AI! The workshop focuses on the interdisciplinary field between machine learning and biological science. We look forward to seeing you in San Diego! #NeurIPS2025

1

14

20

0

3.0K

Z

Zeyi Liao@LiaoZeyi · Jul 17

Appreciate the transparency - highlighting agent risks is essential. Our project, RedTeamCUA, uses a hybrid sandbox to test how bad actors can trick computer-use agents to perform harmful actions, safely highlighting realistic risks before deployment! x.com/LiaoZeyi/statu…

ZZeyi Liao@LiaoZeyi · May 30

⁉️Can you really trust Computer-Use Agents (CUAs) to control your computer⁉️ Not yet, @AnthropicAI Opus 4 shows an alarming 48% Attack Success Rate against realistic internet injection❗️ Introducing RedTeamCUA: realistic, interactive, and controlled sandbox environments for…

0

3

10

1

611

Zeyi Liao Retweeted

H

Huan Sun (OSU)@hhsun1 · Jul 15

🚨 Postdoc Hiring: I am looking for a postdoc to work on rigorously evaluating and advancing the capabilities and safety of computer-use agents (CUAs), co-advised with @ysu_nlp @osunlp. We welcome strong applicants with experience in CUAs, long-horizon reasoning/planning,…

1

27

61

6

14.0K

Zeyi Liao Retweeted

Y

Yu Su@ysu_nlp · Jun 27

🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…

3

48

221

132

38.0K

Zeyi Liao Retweeted

Y

Yifei Li@YifeiLiPKU · Jun 12

📢 Introducing AutoSDT, a fully automatic pipeline that collects data-driven scientific coding tasks at scale! We use AutoSDT to collect AutoSDT-5K, enabling open co-scientist models that rival GPT-4o on ScienceAgentBench! Thread below ⬇️ (1/n)

4

25

71

20

9.0K

Zeyi Liao Retweeted

E

Ekdeep Singh@EkdeepL · Jun 6

🚨 New paper alert! Linear representation hypothesis (LRH) argues concepts are encoded as **sparse sum of orthogonal directions**, motivating interpretability tools like SAEs. But what if some concepts don’t fit that mold? Would SAEs capture them? 🤔 1/11

5

63

386

415

38.0K

Zeyi Liao Retweeted

B

Botao Yu@BotaoYu24 · Jun 6

🔬 Introducing ChemMCP, the first MCP-compatible toolkit for empowering AI models with advanced chemistry capabilities! In recent years, we’ve seen rising interest in tool-using AI agents across domains. Particularly in scientific domains like chemistry, LLMs alone still fall…

3

30

68

20

8.0K

Z

Zeyi Liao@LiaoZeyi · Jun 4

I believe computer use, in principle, is much harder than math/coding for current AI. the digital world encompasses a much larger part of the complexity in this world. The goals are often vastly underspecified and require accessing and understanding broad context (in users’ head…

DDwarkesh Patel@dwarkesh_sp · Jun 2

New blog post where I explain why I disagree with this, and why I have slightly longer timelines to AGI than many of my guests. I think continual learning is a huge bottleneck to the usefulness of these models, and extended computer use may take years to sort out. L-nk below.

7

56

16

6.0K

Zeyi Liao Retweeted

R

Rui Qiu@RuiQiu18 · Jun 3

Systematic reviews (SRs) drive evidence-based medicine, but months-long workflows can’t keep pace with today’s literature flood. Fully autonomous solutions promise speed, but the magic often fizzles - these models still skip pivotal trials, hallucinate findings, and bury the…

1

16

22

5

4.0K

Z

Zeyi Liao@LiaoZeyi · May 30

It is great to see new works like this focus on building a sandbox environment to test the safety of autonomous agents. This type of work can unlock a lot of use cases and assists the testing of different threat models.

ZZeyi Liao@LiaoZeyi · May 30

⁉️Can you really trust Computer-Use Agents (CUAs) to control your computer⁉️ Not yet, @AnthropicAI Opus 4 shows an alarming 48% Attack Success Rate against realistic internet injection❗️ Introducing RedTeamCUA: realistic, interactive, and controlled sandbox environments for…

0

1

8

0

506

Z

Zeyi Liao@LiaoZeyi · May 30

Realistic adversarial testing of Computer-Use Agents (CUAs) to identify their vulnerabilities and make them safer and more secure is … hard. Is @AnthropicAI Claude 4 Opus more robust to indirect prompt injection than previous versions like Claude 3.7? Not really. Why hard?…

ZZeyi Liao@LiaoZeyi · May 30

⁉️Can you really trust Computer-Use Agents (CUAs) to control your computer⁉️ Not yet, @AnthropicAI Opus 4 shows an alarming 48% Attack Success Rate against realistic internet injection❗️ Introducing RedTeamCUA: realistic, interactive, and controlled sandbox environments for…

3

24

57

30

10.0K

Z

Zeyi Liao@LiaoZeyi · May 30

So thrilled to share "RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments"! With our novel hybrid testing sandbox, systematic analysis using our RTC-Bench benchmark shows even the new Claude 4 Opus hits a 48% Attack Success Rate!🤯…

ZZeyi Liao@LiaoZeyi · May 30

⁉️Can you really trust Computer-Use Agents (CUAs) to control your computer⁉️ Not yet, @AnthropicAI Opus 4 shows an alarming 48% Attack Success Rate against realistic internet injection❗️ Introducing RedTeamCUA: realistic, interactive, and controlled sandbox environments for…

1

3

6

1

557