Boyu Gou (@BoyuGouNLP)

Pinned

B

Boyu Gou@BoyuGouNLP · Jul 4

🧐Curious how far Claude Research can go in freeing you from tedious daily tasks? 🚀Check out our new results on Mind2Web 2! 💡 Looking forward to seeing even better agentic search systems! 🙌 Join the effort and test your system on Mind2Web 2 today!

YYu Su@ysu_nlp · Jun 27

🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…

2

4

21

4

2.0K

Boyu Gou Retweeted

J

Jianyang Gu@vimar_gu · Jul 23

Announcing the @NeurIPSConf 2025 workshop on Imageomics: Discovering Biological Knowledge from Images Using AI! The workshop focuses on the interdisciplinary field between machine learning and biological science. We look forward to seeing you in San Diego! #NeurIPS2025

1

14

20

0

4.0K

Boyu Gou Retweeted

Q

Qwen@Alibaba_Qwen · Jul 22

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…

275

1.0K

9.0K

4.0K

1.8M

Boyu Gou Retweeted

M

Multi-Turn Interaction LLM Workshop @ NeurIPS 2025@mti_neurips · Jul 21

🚀 Call for Papers — @NeurIPSConf 2025 Workshop Multi-Turn Interactions in LLMs 📅 December 6/7 · 📍 San Diego Convention Center Join us to shape the future of interactive AI. Topics include but are not limited to: 🧠 Multi-Turn RL for Agentic Tasks (e.g., web & GUI agents,…

2

23

102

52

29.0K

Boyu Gou Retweeted

R

Rohan Paul@rohanpaul_ai · Jul 16

🎯 An ex-OpenAI engineer shares his thoughts about OpenAI Has lots of insights on OpenAI’s day-to-day life, unlike anything I have read before. He joined OpenAI as a software engineer on the applied side, spending about 14 months building the Codex coding agent and related…

18

62

785

826

116.0K

Boyu Gou Retweeted

H

Huan Sun (OSU)@hhsun1 · Jul 15

🚨 Postdoc Hiring: I am looking for a postdoc to work on rigorously evaluating and advancing the capabilities and safety of computer-use agents (CUAs), co-advised with @ysu_nlp @osunlp. We welcome strong applicants with experience in CUAs, long-horizon reasoning/planning,…

1

28

61

6

15.0K

B

Boyu Gou@BoyuGouNLP · Jul 15

Huan and I are looking for a postdoc to join us on agent research (broadly defined: planning, reasoning, safety, memory, continual learning, etc.). If you have a strong record in this space, drop us an email with CV! Retweet appreciated.

HHuan Sun (OSU)@hhsun1 · Jul 15

🚨 Postdoc Hiring: I am looking for a postdoc to work on rigorously evaluating and advancing the capabilities and safety of computer-use agents (CUAs), co-advised with @ysu_nlp @osunlp. We welcome strong applicants with experience in CUAs, long-horizon reasoning/planning,…

0

15

49

9

9.0K

Boyu Gou Retweeted

A

Andrej Karpathy@karpathy · Jul 13

Scaling up RL is all the rage right now, I had a chat with a friend about it yesterday. I'm fairly certain RL will continue to yield more intermediate gains, but I also don't expect it to be the full story. RL is basically "hey this happened to go well (/poorly), let me slightly…

412

851

8.0K

5.0K

1.0M

B

Boyu Gou@BoyuGouNLP · Jul 8

Thrilled to announce that our work Online-Mind2Web has been accepted to @COLM_conf ! 🎉 It's my first PhD work and first paper at COLM. See you in Montreal! 🍁 Several teams are already testing their agents on Online-Mind2Web. If you're curious about how your agent performs, try…

TTianci Xue@xue_tianci · May 13

🚀Exciting update about our work! "An Illusion of Progress? Assessing the Current State of Web Agents." ✨ What’s New? 🆕 Claude Computer Use 3.7 performance analysis. 🆕 WebJudge, powered by o4-mini, achieves a remarkable 3.8% success rate gap with human judgment, demonstrating…

1

5

34

1

3.0K

Boyu Gou Retweeted

X

Xiang Yue@xiangyue96 · Jul 2

People are racing to push math reasoning performance in #LLMs—but have we really asked why? The common assumption is that improving math reasoning should transfer to broader capabilities in other domains. But is that actually true? In our study (arxiv.org/pdf/2507.00432), we…

15

124

610

399

59.0K

Boyu Gou Retweeted

T

TuringPost@TheTuringPost · Jul 1

The freshest AI/ML research papers of the week Our top 7: ▪️ OctoThinker ▪️ Performance Prediction for Large Systems via Text-to-Text Regression ▪️ Radial Attention ▪️ MADrive ▪️ Mind2Web 2 ▪️ Chain-of-Experts ▪️ Ark ▪️ Where to find Grokking ▪️ Skywork-SWE ▪️ BlenderFusion ▪️…

6

37

145

94

10.0K

Boyu Gou Retweeted

R

Rohan Paul@rohanpaul_ai · Jun 28

Agentic search systems for web-scale information face an evaluation crisis due to their growing complexity and long, dynamic tasks. Mind2Web 2 provides a benchmark of 130 realistic, long-horizon tasks and a novel Agent-as-a-Judge framework to rigorously evaluate these systems.…

1

10

37

40

5.0K

B

Boyu Gou@BoyuGouNLP · Jun 27

Rigorously evaluating agentic systems has been one of our pursuits at @osunlp, with prior efforts including Mind2Web and ScienceAgentBench. Today we introduce Mind2Web 2 to evaluate the emerging Deep Research-like agents: It features realistic and diverse long-horizon web…

YYu Su@ysu_nlp · Jun 27

🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…

0

6

38

10

3.0K

B

Boyu Gou@BoyuGouNLP · Jun 27

Our group is known for producing widely adopted benchmarks (MMMU, Mind2Web, TravelPlaner, ScienceAgentBench etc.). Mind2Web 2 is probably the benchmark we spent the most time on ever. 26 authors spent over 6 months to tackle the emerging evaluation crisis head-on. Check it out!

YYu Su@ysu_nlp · Jun 27

🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…

0

4

24

3

2.0K

B

Boyu Gou@BoyuGouNLP · Jun 27

Deep research systems have become a part of our daily life for very complicated tasks over large amount of resources, e.g., the whole Internet. Applications include intelligence-intensive tasks like literature review, travel planning, purchase advices. With fast evolving systems,…

YYu Su@ysu_nlp · Jun 27

🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…

0

1

4

0

111

Boyu Gou Retweeted

Y

Yu Su@ysu_nlp · Jun 27

🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…

3

46

221

133

39.0K

B

Boyu Gou@BoyuGouNLP · Jun 14

Are you at #CVPR2025? RoboSpatial Oral is today! 📅 June 14 (Sat) | 🕐 1:00 PM | 📍Oral Session 4B @ ExHall A2

CChan Hee (Luke) Song@luke_ch_song · Mar 31

🔥 VLMs aren’t built for spatial reasoning — yet. They hallucinate free space. Misjudge object fit. Can’t tell below from behind We built RoboSpatial to tackle that — a dataset for teaching spatial understanding to 2D/3D VLMs for robotics. 📝 Perfect review scores @CVPR 2025

0

5

18

0

2.0K

Boyu Gou Retweeted

X

Xing Han Lu@xhluca · Jun 13

"Build the web for agents, not agents for the web" This position paper argues that rather than forcing web agents to adapt to UIs designed for humans, we should develop a new interface optimized for web agents, which we call Agentic Web Interface (AWI).

9

53

193

125

22.0K