Huan Sun (OSU)
@hhsun1
Associate Professor (with Tenure) in CSE, endowed CoE Innovation Scholar, CoP Co-Director @OSUbigdata, The Ohio State University (NLP and Data Mining)
🚨 Postdoc Hiring: I am looking for a postdoc to work on rigorously evaluating and advancing the capabilities and safety of computer-use agents (CUAs), co-advised with @ysu_nlp @osunlp. We welcome strong applicants with experience in CUAs, long-horizon reasoning/planning,…
I’m gonna be recruiting students thru both @LTIatCMU (NLP) and @CMU_EPP (Engineering and Public Policy) for fall 2026! If you are interested in reasoning, memorization, AI for science & discovery and of course privacy, u can catch me at ACL! Prospective students fill this form:
📣Thrilled to announce I’ll join Carnegie Mellon University (@CMU_EPP & @LTIatCMU) as an Assistant Professor starting Fall 2026! Until then, I’ll be a Research Scientist at @AIatMeta FAIR in SF, working with @kamalikac’s amazing team on privacy, security, and reasoning in LLMs!
ScienceAgentBench from OSU examines the ability of agents to do data processing, model development, visualization, etc: arxiv.org/abs/2410.05080 MLEBench from OpenAI examines whether models can implement ML experiments: openai.com/index/mle-benc…
📢Check out this paper led by my amazing student, @AbrahamOwos, on making tokenizers more "flexible" during adaptation to tasks, domains, languages. There's been a lot of interest in removing BPE tokenizers from LLMs by directly (learning to) chunk byte sequences. All these…
We’re thrilled to share our latest work: FLEXITOKENS! In this work, we introduce language models with learnable tokenizers for making tokenization truly flexible during adaptation. See example below ↓ 1/n
Appreciate the transparency - highlighting agent risks is essential. Our project, RedTeamCUA, uses a hybrid sandbox to test how bad actors can trick computer-use agents to perform harmful actions, safely highlighting realistic risks before deployment! x.com/LiaoZeyi/statu…
⁉️Can you really trust Computer-Use Agents (CUAs) to control your computer⁉️ Not yet, @AnthropicAI Opus 4 shows an alarming 48% Attack Success Rate against realistic internet injection❗️ Introducing RedTeamCUA: realistic, interactive, and controlled sandbox environments for…
We are hiring!!
🚨 Postdoc Hiring: I am looking for a postdoc to work on rigorously evaluating and advancing the capabilities and safety of computer-use agents (CUAs), co-advised with @ysu_nlp @osunlp. We welcome strong applicants with experience in CUAs, long-horizon reasoning/planning,…
Huan and I are looking for a postdoc to join us on agent research (broadly defined: planning, reasoning, safety, memory, continual learning, etc.). If you have a strong record in this space, drop us an email with CV! Retweet appreciated.
🚨 Postdoc Hiring: I am looking for a postdoc to work on rigorously evaluating and advancing the capabilities and safety of computer-use agents (CUAs), co-advised with @ysu_nlp @osunlp. We welcome strong applicants with experience in CUAs, long-horizon reasoning/planning,…
Tutorial happening in a minute at West Exhibit Hall C! @DakingRai
Happy to announce that we (w/ my student @DakingRai ) will present a tutorial on 𝐌𝐞𝐜𝐡𝐚𝐧𝐢𝐬𝐭𝐢𝐜 𝐈𝐧𝐭𝐞𝐫𝐩𝐫𝐞𝐭𝐚𝐛𝐢𝐥𝐢𝐭𝐲 𝐟𝐨𝐫 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐌𝐨𝐝𝐞𝐥𝐬! Look forward to meeting people @icmlconf Stay tuned! ziyu-yao-nlp-lab.github.io/ICML25-MI-Tuto… @GeorgeMasonU @GMUCompSci
⬇️ Check out SDE-Harness, our general framework for evaluating LLMs/agents on scientific discovery. It features easy integration, broad LLM support, dynamic prompting, comprehensive logging, and customizable metrics, applicable for all domains and tasks.
🚀🔬 Introducing SDE-Harness: The Scientific Discovery Evaluation Framework A discovery-first, open-source toolkit built to accelerate LLM-driven scientific research and amplify discovery. Why SDE-Harness? Scientific discovery is an iterative process to search for hypotheses…
Thrilled to announce that our work Online-Mind2Web has been accepted to @COLM_conf ! 🎉 It's my first PhD work and first paper at COLM. See you in Montreal! 🍁 Several teams are already testing their agents on Online-Mind2Web. If you're curious about how your agent performs, try…
🚀Exciting update about our work! "An Illusion of Progress? Assessing the Current State of Web Agents." ✨ What’s New? 🆕 Claude Computer Use 3.7 performance analysis. 🆕 WebJudge, powered by o4-mini, achieves a remarkable 3.8% success rate gap with human judgment, demonstrating…
Insightful work! Our recent work also aims to improve the performance of open-source models for data-driven discovery through a fully automatic pipeline to collect high-quality training data, leading to performance that rivals proprietary models! 👇 x.com/HananeNMoussa/…
📢Excited to announce the first project of my PhD! Through our work we address the training data scarcity to develop AI co-scientist models via AutoSDT, a fully automatic pipeline that collects high quality scientific coding tasks at scale! Read more in the full post here 👇
Agentic search systems for web-scale information face an evaluation crisis due to their growing complexity and long, dynamic tasks. Mind2Web 2 provides a benchmark of 130 realistic, long-horizon tasks and a novel Agent-as-a-Judge framework to rigorously evaluate these systems.…
Comprehensive analysis on Mind2Web 2, probably the most informative one on Deep Research-like agentic systems so far!
We also conducted probably the most detailed error analysis of agentic search systems to date. Many common issues in current systems: - Laziness, cannot follow a task all the way through - Hallucination, fabricate citation links or plausible-sounding answer not supported by the…
"Test-time scaling" effect on Mind2Web 2:
We observe strong scaling w.r.t. runtime. The longer an agent can grind at a task, the better it gets.
Our group is known for producing widely adopted benchmarks (MMMU, Mind2Web, TravelPlaner, ScienceAgentBench etc.). Mind2Web 2 is probably the benchmark we spent the most time on ever. 26 authors spent over 6 months to tackle the emerging evaluation crisis head-on. Check it out!
🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…
See how tasks in Mind2Web 2 differs from existing benchmarks. It is the only agentic search benchmark to date focusing on long-horizon, time-varying tasks, and is made possible due to our advanced Agent-as-a-Judge evaluation methodology. Note that even though there are only 130…
Our tasks are very complex, in an authentic way The rubric trees contain avg 50 nodes, and dedicated human evaluators take avg 18 minutes and visit hundreds of webpages. Note that this is still an underestimate because we allow evalutors to give up early; some of the tasks are…
Check out our Agent-as-a-Judge approach in Mind2Web 2, a key technique that enables rigorous evaluation of agent search systems.
Introducing Agent-as-a-Judge: we make a task-specific judge agent for each task. Each task has a rubric tree that decomposes the evaluation criteria: leaf nodes are LLM calls for information extraction, tool calls, and binary judgments, and internal nodes aggregate the scores…
Why another agent benchmark? Existing benchmarks either focus on short-horizon tasks (eg, up to 10 actions) or make a common compromise for auto-eval: only include tasks with a simple pre-determined answer. Agentic search targets long-horizon, time-varying tasks (eg, the IKEA…
Rigorously evaluating agentic systems has been one of our pursuits at @osunlp, with prior efforts including Mind2Web and ScienceAgentBench. Today we introduce Mind2Web 2 to evaluate the emerging Deep Research-like agents: It features realistic and diverse long-horizon web…
🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…
⏳Deadline extended! The submission deadline for XLLM-Reason-Plan has been moved to June 27th. More time to submit your work — we look forward to your submissions! Details: …reasoning-planning-workshop.github.io