Yu Su
@ysu_nlp
Prof.@OhioState, co-director @osunlp. author of Mind2Web, SeeAct, MMMU, HippoRAG, BioCLIP, UGround. manifesting my thinking of intelligence into language agents
Sharing the slides of my talk at Princeton yesterday--"A holistic and critical look at language agents": ysu1989.github.io/resources/lang… LLM-based language agents are exciting, but it's also undeniably a quite chaotic space: are agents the next big thing, or are they just thin wrappers…

Announcing the @NeurIPSConf 2025 workshop on Imageomics: Discovering Biological Knowledge from Images Using AI! The workshop focuses on the interdisciplinary field between machine learning and biological science. We look forward to seeing you in San Diego! #NeurIPS2025
Impressive results. Can’t wait to try.
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…
🚀 Call for Papers — @NeurIPSConf 2025 Workshop Multi-Turn Interactions in LLMs 📅 December 6/7 · 📍 San Diego Convention Center Join us to shape the future of interactive AI. Topics include but are not limited to: 🧠 Multi-Turn RL for Agentic Tasks (e.g., web & GUI agents,…
We’re thrilled to share our latest work: FLEXITOKENS! In this work, we introduce language models with learnable tokenizers for making tokenization truly flexible during adaptation. See example below ↓ 1/n
📢Check out this paper led by my amazing student, @AbrahamOwos, on making tokenizers more "flexible" during adaptation to tasks, domains, languages. There's been a lot of interest in removing BPE tokenizers from LLMs by directly (learning to) chunk byte sequences. All these…
We’re thrilled to share our latest work: FLEXITOKENS! In this work, we introduce language models with learnable tokenizers for making tokenization truly flexible during adaptation. See example below ↓ 1/n
Attending #ICML2025 🇨🇦 this week! I’ll be co-organizing the Computer Use Agent Workshop @workshopcua on July 19th! Happy to chat about anything related to language agents — especially world modeling, scaling RL for agents, and multi-turn RL. Excited to meet old friends and…
Huan and I are looking for a postdoc to join us on agent research (broadly defined: planning, reasoning, safety, memory, continual learning, etc.). If you have a strong record in this space, drop us an email with CV! Retweet appreciated.
🚨 Postdoc Hiring: I am looking for a postdoc to work on rigorously evaluating and advancing the capabilities and safety of computer-use agents (CUAs), co-advised with @ysu_nlp @osunlp. We welcome strong applicants with experience in CUAs, long-horizon reasoning/planning,…
What would truly open-source AI look like? Not just open weights, open code/data, but *open development*, where the entire research and development process is public *and* anyone can contribute. We built Marin, an open lab, to fulfill this vision:
Thrilled to announce that our work Online-Mind2Web has been accepted to @COLM_conf ! 🎉 It's my first PhD work and first paper at COLM. See you in Montreal! 🍁 Several teams are already testing their agents on Online-Mind2Web. If you're curious about how your agent performs, try…
🚀Exciting update about our work! "An Illusion of Progress? Assessing the Current State of Web Agents." ✨ What’s New? 🆕 Claude Computer Use 3.7 performance analysis. 🆕 WebJudge, powered by o4-mini, achieves a remarkable 3.8% success rate gap with human judgment, demonstrating…
🧐Curious how far Claude Research can go in freeing you from tedious daily tasks? 🚀Check out our new results on Mind2Web 2! 💡 Looking forward to seeing even better agentic search systems! 🙌 Join the effort and test your system on Mind2Web 2 today!
🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…
Our study led by @ChengleiSi reveals an “ideation–execution gap” 😲 Ideas from LLMs may sound novel, but when experts spend 100+ hrs executing them, they flop: 💥 👉 human‑generated ideas outperform on novelty, excitement, effectiveness & overall quality!
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
Agentic search systems for web-scale information face an evaluation crisis due to their growing complexity and long, dynamic tasks. Mind2Web 2 provides a benchmark of 130 realistic, long-horizon tasks and a novel Agent-as-a-Judge framework to rigorously evaluate these systems.…
🧐Agentic search is revolutionizing how we gather information, but how reliable is it? Can it really deliver accurate answers with proper source attribution? 🚀Super excited to share our new work, Mind2Web 2, a rigorous agentic search benchmark with 130 realistic and…
🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…
Rigorously evaluating agentic systems has been one of our pursuits at @osunlp, with prior efforts including Mind2Web and ScienceAgentBench. Today we introduce Mind2Web 2 to evaluate the emerging Deep Research-like agents: It features realistic and diverse long-horizon web…
🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…