Kai Zhang
@DrogoKhal4
PhD-ing @osunlp with @ysu_nlp
🚀Big WebDreamer update! We train 💭Dreamer-7B, a small but strong world model for real-world web planning. 💥Beats Qwen2-72B ⚖️Matches #GPT-4o Trained on 3M synthetic examples — and yes, all data + models are open-sourced.
❓Wondering how to scale inference-time compute with advanced planning for language agents? 🙋♂️Short answer: Using your LLM as a world model 💡More detailed answer: Using GPT-4o to predict the outcome of actions on a website can deliver strong performance with improved safety and…
Attending #CVPR2025 in #Nashville! We will have our multimodal LLM evaluation tutorial tmr afternoon! Feel free to ping me any time if you'd like to chat multimodal, reasoning, evaluation, etc.! DM is open!
🗒️Have been exploring Agent-RL training over the past few months, particularly in GUI scenarios. Here’s a summary of some practical insights and lessons 🤔 learned from the perspective of an industry researcher, and some reference papers.
ScienceAgentBench from OSU examines the ability of agents to do data processing, model development, visualization, etc: arxiv.org/abs/2410.05080 MLEBench from OpenAI examines whether models can implement ML experiments: openai.com/index/mle-benc…
🚀 Call for Papers — @NeurIPSConf 2025 Workshop Multi-Turn Interactions in LLMs 📅 December 6/7 · 📍 San Diego Convention Center Join us to shape the future of interactive AI. Topics include but are not limited to: 🧠 Multi-Turn RL for Agentic Tasks (e.g., web & GUI agents,…
🚨 Postdoc Hiring: I am looking for a postdoc to work on rigorously evaluating and advancing the capabilities and safety of computer-use agents (CUAs), co-advised with @ysu_nlp @osunlp. We welcome strong applicants with experience in CUAs, long-horizon reasoning/planning,…
We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces…
We're already using AI search systems every day for more and more complex tasks, but how good are they really? Challenge: evaluation is hard with no fixed ground truth! In Mind2Web 2, we use agents to evaluate agents. Really excited! Thanks to everyone who made this possible!
🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…
🧐Agentic search is revolutionizing how we gather information, but how reliable is it? Can it really deliver accurate answers with proper source attribution? 🚀Super excited to share our new work, Mind2Web 2, a rigorous agentic search benchmark with 130 realistic and…
🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…
🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…
Impressed by V-JEPA 2's improvement on TemporalBench temporalbench.github.io Indeed, we need a better video encoder for the temporal-heavy tasks!
Introducing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments. Download V-JEPA 2 and read our research paper…
📢 Introducing AutoSDT, a fully automatic pipeline that collects data-driven scientific coding tasks at scale! We use AutoSDT to collect AutoSDT-5K, enabling open co-scientist models that rival GPT-4o on ScienceAgentBench! Thread below ⬇️ (1/n)
Are we heading down the right path towards omni-modality? 🤔 This new paper explores the effects of extending modality in language models.
Thanks for sharing our work :)
Are we heading down the right path towards omni-modality? 🤔 This new paper explores the effects of extending modality in language models.
📢 Introducing VisCoder – fine-tuned language models for Python-based visualization code generation and feedback-driven self-debugging. Existing LLMs struggle to generate reliable plotting code: outputs often raise exceptions, produce blank visuals, or fail to reflect the…
Had a blast working with @DarthZhu_ ! We try to analyze and use the modality-specific models extended from the same #LLM backbones to create omni ones. e.g., Qwen2-VL, -Video, -Audio, on #Qwen2 Tho most results are negative, we have some interesting findings here :)
😴 Extending modality based on an LLM has been a common practice when we are talking about multimodal LLMs. ❓ Can it generalize to omni-modality? We study the effects of extending modality and ask three questions: arxiv.org/abs/2506.01872 #LLM #MLLM #OmniModality
Dear MAGA friends, I have been worrying about STEM in the US a lot, because right now the Senate is writing new laws that cut 75% of the STEM budget in the US. Sorry for the long post, but the issue is really important, and I want to share what I know about it. The entire…
Thrilled to announce that I will be joining @UTAustin @UTCompSci as an assistant professor in fall 2026! I will continue working on language models, data challenges, learning paradigms, & AI for innovation. Looking forward to teaming up with new students & colleagues! 🤠🤘
Realistic adversarial testing of Computer-Use Agents (CUAs) to identify their vulnerabilities and make them safer and more secure is … hard. Is @AnthropicAI Claude 4 Opus more robust to indirect prompt injection than previous versions like Claude 3.7? Not really. Why hard?…
⁉️Can you really trust Computer-Use Agents (CUAs) to control your computer⁉️ Not yet, @AnthropicAI Opus 4 shows an alarming 48% Attack Success Rate against realistic internet injection❗️ Introducing RedTeamCUA: realistic, interactive, and controlled sandbox environments for…
Sigh, it's a bit of a mess. Let me just give you guys the full nuance in one stream of consciousness since I think we'll continue to get partial interpretations that confuse everyone. All the little things I post need to always be put together in one place. First, I have long…
Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇