Kai Zhang

@DrogoKhal4

PhD-ing @osunlp with @ysu_nlp

Columbus, OH

Joined February 2019

756Following

2KFollowers

Pinned

Kai Zhang@DrogoKhal4 · Apr 10

🚀Big WebDreamer update! We train 💭Dreamer-7B, a small but strong world model for real-world web planning. 💥Beats Qwen2-72B ⚖️Matches #GPT-4o Trained on 3M synthetic examples — and yes, all data + models are open-sourced.

YYu Gu@yugu_nlp · Nov 21

❓Wondering how to scale inference-time compute with advanced planning for language agents? 🙋‍♂️Short answer: Using your LLM as a world model 💡More detailed answer: Using GPT-4o to predict the outcome of actions on a website can deliver strong performance with improved safety and…

15.0K

Pinned

Kai Zhang Retweeted

Xiang Yue@xiangyue96 · Jun 11

Attending #CVPR2025 in #Nashville! We will have our multimodal LLM evaluation tutorial tmr afternoon! Feel free to ping me any time if you'd like to chat multimodal, reasoning, evaluation, etc.! DM is open!

5.0K

Kai Zhang Retweeted

Wenhao Yu @ ICML 2025@wyu_nd · Jul 24

🗒️Have been exploring Agent-RL training over the past few months, particularly in GUI scenarios. Here’s a summary of some practical insights and lessons 🤔 learned from the perspective of an industry researcher, and some reference papers.

7.0K

Kai Zhang Retweeted

Graham Neubig@gneubig · Oct 11

ScienceAgentBench from OSU examines the ability of agents to do data processing, model development, visualization, etc: arxiv.org/abs/2410.05080 MLEBench from OpenAI examines whether models can implement ML experiments: openai.com/index/mle-benc…

2.0K

Kai Zhang Retweeted

Multi-Turn Interaction LLM Workshop @ NeurIPS 2025@mti_neurips · Jul 21

🚀 Call for Papers — @NeurIPSConf 2025 Workshop Multi-Turn Interactions in LLMs 📅 December 6/7 · 📍 San Diego Convention Center Join us to shape the future of interactive AI. Topics include but are not limited to: 🧠 Multi-Turn RL for Agentic Tasks (e.g., web & GUI agents,…

101

28.0K

Kai Zhang Retweeted

Huan Sun (OSU)@hhsun1 · Jul 15

🚨 Postdoc Hiring: I am looking for a postdoc to work on rigorously evaluating and advancing the capabilities and safety of computer-use agents (CUAs), co-advised with @ysu_nlp @osunlp. We welcome strong applicants with experience in CUAs, long-horizon reasoning/planning,…

14.0K

Kai Zhang Retweeted

Lisan al Gaib@scaling01 · Jul 5

this is scientific seppuku

182

2.0K

500

301.0K

Kai Zhang Retweeted

Bo Liu (Benjamin Liu)@Benjamin_eecs · Jul 1

We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces…

268

179

62.0K

Kai Zhang@DrogoKhal4 · Jun 27

We're already using AI search systems every day for more and more complex tasks, but how good are they really? Challenge: evaluation is hard with no fixed ground truth! In Mind2Web 2, we use agents to evaluate agents. Really excited! Thanks to everyone who made this possible!

YYu Su@ysu_nlp · Jun 27

🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…

832

Kai Zhang@DrogoKhal4 · Jun 27

🧐Agentic search is revolutionizing how we gather information, but how reliable is it? Can it really deliver accurate answers with proper source attribution? 🚀Super excited to share our new work, Mind2Web 2, a rigorous agentic search benchmark with 130 realistic and…

YYu Su@ysu_nlp · Jun 27

3.0K

Kai Zhang Retweeted

Yu Su@ysu_nlp · Jun 27

221

132

38.0K

Kai Zhang@DrogoKhal4 · Jun 16

Impressed by V-JEPA 2's improvement on TemporalBench temporalbench.github.io Indeed, we need a better video encoder for the temporal-heavy tasks!

AAI at Meta@AIatMeta · Jun 11

Introducing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments. Download V-JEPA 2 and read our research paper…

4.0K

Kai Zhang Retweeted

Yifei Li@YifeiLiPKU · Jun 12

📢 Introducing AutoSDT, a fully automatic pipeline that collects data-driven scientific coding tasks at scale! We use AutoSDT to collect AutoSDT-5K, enabling open co-scientist models that rival GPT-4o on ScienceAgentBench! Thread below ⬇️ (1/n)

9.0K

Kai Zhang Retweeted

DailyPapers@HuggingPapers · Jun 9

Are we heading down the right path towards omni-modality? 🤔 This new paper explores the effects of extending modality in language models.

117

14.0K

Kai Zhang@DrogoKhal4 · Jun 11

Thanks for sharing our work :)

DDailyPapers@HuggingPapers · Jun 9

Are we heading down the right path towards omni-modality? 🤔 This new paper explores the effects of extending modality in language models.

610

Kai Zhang Retweeted

Yuansheng Ni@YuanshengNi · Jun 5

📢 Introducing VisCoder – fine-tuned language models for Python-based visualization code generation and feedback-driven self-debugging. Existing LLMs struggle to generate reliable plotting code: outputs often raise exceptions, produce blank visuals, or fail to reflect the…

10.0K

Kai Zhang@DrogoKhal4 · Jun 3

Had a blast working with @DarthZhu_ ! We try to analyze and use the modality-specific models extended from the same #LLM backbones to create omni ones. e.g., Qwen2-VL, -Video, -Audio, on #Qwen2 Tho most results are negative, we have some interesting findings here :)

TTinghui Zhu@DarthZhu_ · Jun 3

😴 Extending modality based on an LLM has been a common practice when we are talking about multimodal LLMs. ❓ Can it generalize to omni-modality? We study the effects of extending modality and ask three questions: arxiv.org/abs/2506.01872 #LLM #MLLM #OmniModality

1.0K

Kai Zhang Retweeted

David Bau@davidbau · Jun 1

Dear MAGA friends, I have been worrying about STEM in the US a lot, because right now the Senate is writing new laws that cut 75% of the STEM budget in the US. Sorry for the long post, but the issue is really important, and I want to share what I know about it. The entire…

475

138

121.0K

Kai Zhang Retweeted

Yizhong Wang@yizhongwyz · May 30

Thrilled to announce that I will be joining @UTAustin @UTCompSci as an assistant professor in fall 2026! I will continue working on language models, data challenges, learning paradigms, & AI for innovation. Looking forward to teaming up with new students & colleagues! 🤠🤘

101

669

72.0K

Kai Zhang@DrogoKhal4 · May 30

Realistic adversarial testing of Computer-Use Agents (CUAs) to identify their vulnerabilities and make them safer and more secure is … hard. Is @AnthropicAI Claude 4 Opus more robust to indirect prompt injection than previous versions like Claude 3.7? Not really. Why hard?…

ZZeyi Liao@LiaoZeyi · May 30

⁉️Can you really trust Computer-Use Agents (CUAs) to control your computer⁉️ Not yet, @AnthropicAI Opus 4 shows an alarming 48% Attack Success Rate against realistic internet injection❗️ Introducing RedTeamCUA: realistic, interactive, and controlled sandbox environments for…

10.0K

Kai Zhang@DrogoKhal4 · May 29

Sigh, it's a bit of a mess. Let me just give you guys the full nuance in one stream of consciousness since I think we'll continue to get partial interpretations that confuse everyone. All the little things I post need to always be put together in one place. First, I have long…

SShashwat Goel@ShashwatGoel7 · May 29

Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇

599

437

82.0K