Huan Sun (OSU)

@hhsun1

Associate Professor (with Tenure) in CSE, endowed CoE Innovation Scholar, CoP Co-Director @OSUbigdata, The Ohio State University (NLP and Data Mining)

The Ohio State University

Joined March 2012

549Following

5KFollowers

Pinned

Huan Sun (OSU)@hhsun1 · Jul 15

🚨 Postdoc Hiring: I am looking for a postdoc to work on rigorously evaluating and advancing the capabilities and safety of computer-use agents (CUAs), co-advised with @ysu_nlp @osunlp. We welcome strong applicants with experience in CUAs, long-horizon reasoning/planning,…

14.0K

Huan Sun (OSU)@hhsun1 · Jul 26

I’m gonna be recruiting students thru both @LTIatCMU (NLP) and @CMU_EPP (Engineering and Public Policy) for fall 2026! If you are interested in reasoning, memorization, AI for science & discovery and of course privacy, u can catch me at ACL! Prospective students fill this form:

NNiloofar (✈️ ACL)@niloofar_mire · May 6

📣Thrilled to announce I’ll join Carnegie Mellon University (@CMU_EPP & @LTIatCMU) as an Assistant Professor starting Fall 2026! Until then, I’ll be a Research Scientist at @AIatMeta FAIR in SF, working with @kamalikac’s amazing team on privacy, security, and reasoning in LLMs!

207

27.0K

Huan Sun (OSU) Retweeted

Graham Neubig@gneubig · Oct 11

ScienceAgentBench from OSU examines the ability of agents to do data processing, model development, visualization, etc: arxiv.org/abs/2410.05080 MLEBench from OpenAI examines whether models can implement ML experiments: openai.com/index/mle-benc…

2.0K

Huan Sun (OSU)@hhsun1 · Jul 18

📢Check out this paper led by my amazing student, @AbrahamOwos, on making tokenizers more "flexible" during adaptation to tasks, domains, languages. There's been a lot of interest in removing BPE tokenizers from LLMs by directly (learning to) chunk byte sequences. All these…

AAbraham Owodunni@AbrahamOwos · Jul 17

We’re thrilled to share our latest work: FLEXITOKENS! In this work, we introduce language models with learnable tokenizers for making tokenization truly flexible during adaptation. See example below ↓ 1/n

2.0K

Huan Sun (OSU)@hhsun1 · Jul 17

Appreciate the transparency - highlighting agent risks is essential. Our project, RedTeamCUA, uses a hybrid sandbox to test how bad actors can trick computer-use agents to perform harmful actions, safely highlighting realistic risks before deployment! x.com/LiaoZeyi/statu…

ZZeyi Liao@LiaoZeyi · May 30

⁉️Can you really trust Computer-Use Agents (CUAs) to control your computer⁉️ Not yet, @AnthropicAI Opus 4 shows an alarming 48% Attack Success Rate against realistic internet injection❗️ Introducing RedTeamCUA: realistic, interactive, and controlled sandbox environments for…

612

Huan Sun (OSU)@hhsun1 · Jul 15

We are hiring!!

HHuan Sun (OSU)@hhsun1 · Jul 15

668

Huan Sun (OSU)@hhsun1 · Jul 15

Huan and I are looking for a postdoc to join us on agent research (broadly defined: planning, reasoning, safety, memory, continual learning, etc.). If you have a strong record in this space, drop us an email with CV! Retweet appreciated.

HHuan Sun (OSU)@hhsun1 · Jul 15

9.0K

Huan Sun (OSU)@hhsun1 · Jul 14

Tutorial happening in a minute at West Exhibit Hall C! @DakingRai

ZZiyu Yao@ZiyuYao · May 12

Happy to announce that we (w/ my student @DakingRai ) will present a tutorial on 𝐌𝐞𝐜𝐡𝐚𝐧𝐢𝐬𝐭𝐢𝐜 𝐈𝐧𝐭𝐞𝐫𝐩𝐫𝐞𝐭𝐚𝐛𝐢𝐥𝐢𝐭𝐲 𝐟𝐨𝐫 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐌𝐨𝐝𝐞𝐥𝐬! Look forward to meeting people @icmlconf Stay tuned! ziyu-yao-nlp-lab.github.io/ICML25-MI-Tuto… @GeorgeMasonU @GMUCompSci

3.0K

Huan Sun (OSU)@hhsun1 · Jul 9

⬇️ Check out SDE-Harness, our general framework for evaluating LLMs/agents on scientific discovery. It features easy integration, broad LLM support, dynamic prompting, comprehensive logging, and customizable metrics, applicable for all domains and tasks.

YYue Huang@HowieH36226 · Jul 9

🚀🔬 Introducing SDE-Harness: The Scientific Discovery Evaluation Framework A discovery-first, open-source toolkit built to accelerate LLM-driven scientific research and amplify discovery. Why SDE-Harness? Scientific discovery is an iterative process to search for hypotheses…

734

Huan Sun (OSU)@hhsun1 · Jul 8

Thrilled to announce that our work Online-Mind2Web has been accepted to @COLM_conf ! 🎉 It's my first PhD work and first paper at COLM. See you in Montreal! 🍁 Several teams are already testing their agents on Online-Mind2Web. If you're curious about how your agent performs, try…

TTianci Xue@xue_tianci · May 13

🚀Exciting update about our work! "An Illusion of Progress? Assessing the Current State of Web Agents." ✨ What’s New? 🆕 Claude Computer Use 3.7 performance analysis. 🆕 WebJudge, powered by o4-mini, achieves a remarkable 3.8% success rate gap with human judgment, demonstrating…

3.0K

Huan Sun (OSU)@hhsun1 · Jun 30

Insightful work! Our recent work also aims to improve the performance of open-source models for data-driven discovery through a fully automatic pipeline to collect high-quality training data, leading to performance that rivals proprietary models! 👇 x.com/HananeNMoussa/…

HHanane Nour Moussa@HananeNMoussa · Jun 12

📢Excited to announce the first project of my PhD! Through our work we address the training data scarcity to develop AI co-scientist models via AutoSDT, a fully automatic pipeline that collects high quality scientific coding tasks at scale! Read more in the full post here 👇

637

Huan Sun (OSU) Retweeted

Rohan Paul@rohanpaul_ai · Jun 28

Agentic search systems for web-scale information face an evaluation crisis due to their growing complexity and long, dynamic tasks. Mind2Web 2 provides a benchmark of 130 realistic, long-horizon tasks and a novel Agent-as-a-Judge framework to rigorously evaluate these systems.…

5.0K

Huan Sun (OSU)@hhsun1 · Jun 27

Comprehensive analysis on Mind2Web 2, probably the most informative one on Deep Research-like agentic systems so far!

YYu Su@ysu_nlp · Jun 27

We also conducted probably the most detailed error analysis of agentic search systems to date. Many common issues in current systems: - Laziness, cannot follow a task all the way through - Hallucination, fabricate citation links or plausible-sounding answer not supported by the…

747

Huan Sun (OSU)@hhsun1 · Jun 27

"Test-time scaling" effect on Mind2Web 2:

YYu Su@ysu_nlp · Jun 27

We observe strong scaling w.r.t. runtime. The longer an agent can grind at a task, the better it gets.

542

Huan Sun (OSU)@hhsun1 · Jun 27

Our group is known for producing widely adopted benchmarks (MMMU, Mind2Web, TravelPlaner, ScienceAgentBench etc.). Mind2Web 2 is probably the benchmark we spent the most time on ever. 26 authors spent over 6 months to tackle the emerging evaluation crisis head-on. Check it out!

YYu Su@ysu_nlp · Jun 27

🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…

2.0K

Huan Sun (OSU)@hhsun1 · Jun 27

See how tasks in Mind2Web 2 differs from existing benchmarks. It is the only agentic search benchmark to date focusing on long-horizon, time-varying tasks, and is made possible due to our advanced Agent-as-a-Judge evaluation methodology. Note that even though there are only 130…

YYu Su@ysu_nlp · Jun 27

Our tasks are very complex, in an authentic way The rubric trees contain avg 50 nodes, and dedicated human evaluators take avg 18 minutes and visit hundreds of webpages. Note that this is still an underestimate because we allow evalutors to give up early; some of the tasks are…

765

Huan Sun (OSU)@hhsun1 · Jun 27

Check out our Agent-as-a-Judge approach in Mind2Web 2, a key technique that enables rigorous evaluation of agent search systems.

YYu Su@ysu_nlp · Jun 27

Introducing Agent-as-a-Judge: we make a task-specific judge agent for each task. Each task has a rubric tree that decomposes the evaluation criteria: leaf nodes are LLM calls for information extraction, tool calls, and binary judgments, and internal nodes aggregate the scores…

667

Huan Sun (OSU) Retweeted

Yu Su@ysu_nlp · Jun 27

Why another agent benchmark? Existing benchmarks either focus on short-horizon tasks (eg, up to 10 actions) or make a common compromise for auto-eval: only include tasks with a simple pre-determined answer. Agentic search targets long-horizon, time-varying tasks (eg, the IKEA…

1.0K

Huan Sun (OSU)@hhsun1 · Jun 27

Rigorously evaluating agentic systems has been one of our pursuits at @osunlp, with prior efforts including Mind2Web and ScienceAgentBench. Today we introduce Mind2Web 2 to evaluate the emerging Deep Research-like agents: It features realistic and diverse long-horizon web…

YYu Su@ysu_nlp · Jun 27

3.0K

Huan Sun (OSU) Retweeted

XLLM-Reason-Plan@XllmReasonPlan · Jun 24

⏳Deadline extended! The submission deadline for XLLM-Reason-Plan has been moved to June 27th. More time to submit your work — we look forward to your submissions! Details: …reasoning-planning-workshop.github.io

2.0K