Boshi Wang
@BoshiWang2
Fourth-year Ph.D. @OhioState. Prev intern @MSFTResearch
LLMs exhibit the Reversal Curse, a basic generalization failure where they struggle to learn reversible factual associations (e.g., "A is B" -> "B is A"). But why? Our new work uncovers that it's a symptom of the long-standing binding problem in AI, and shows that a model design…

🚨 Postdoc Hiring: I am looking for a postdoc to work on rigorously evaluating and advancing the capabilities and safety of computer-use agents (CUAs), co-advised with @ysu_nlp @osunlp. We welcome strong applicants with experience in CUAs, long-horizon reasoning/planning,…
🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…
📢 Introducing AutoSDT, a fully automatic pipeline that collects data-driven scientific coding tasks at scale! We use AutoSDT to collect AutoSDT-5K, enabling open co-scientist models that rival GPT-4o on ScienceAgentBench! Thread below ⬇️ (1/n)
📈 Scaling may be hitting a wall in the digital world, but it's only beginning in the biological world! We trained a foundation model on 214M images of ~1M species (50% of named species on Earth 🐨🐠🌻🦠) and found emergent properties capturing hidden regularities in nature. 🧵
🔬 Introducing ChemMCP, the first MCP-compatible toolkit for empowering AI models with advanced chemistry capabilities! In recent years, we’ve seen rising interest in tool-using AI agents across domains. Particularly in scientific domains like chemistry, LLMs alone still fall…
Systematic reviews (SRs) drive evidence-based medicine, but months-long workflows can’t keep pace with today’s literature flood. Fully autonomous solutions promise speed, but the magic often fizzles - these models still skip pivotal trials, hallucinate findings, and bury the…
⁉️Can you really trust Computer-Use Agents (CUAs) to control your computer⁉️ Not yet, @AnthropicAI Opus 4 shows an alarming 48% Attack Success Rate against realistic internet injection❗️ Introducing RedTeamCUA: realistic, interactive, and controlled sandbox environments for…
🚀 Thrilled to unveil the most exciting project of my PhD: Explorer — Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents TL;DR: A scalable multi-agent pipeline that leverages exploration for diverse web agent trajectory synthesis. 📄 Paper:…
I will miss #NAACL2025 unfortunately, but please check out our work on chemistry agents, "ChemToolAgent: The Impact of Tools on Language Agents for Chemistry Problem Solving" today (May 1) during 2:00-3:30pm (local time) at Hall 3, Poster Session 5! Some updates: We have renamed…
🚨We just released the data generation code for RoboSpatial! 💾 github.com/NVlabs/RoboSpa… 📢 And yes, RoboSpatial is a #CVPR2025 Oral 🏆🔥
🔥 VLMs aren’t built for spatial reasoning — yet. They hallucinate free space. Misjudge object fit. Can’t tell below from behind We built RoboSpatial to tackle that — a dataset for teaching spatial understanding to 2D/3D VLMs for robotics. 📝 Perfect review scores @CVPR 2025
“What's the role of NLP/LLM researchers in agent research?” “Natural language is merely a tool for communication.” … These doubts and criticisms have circulated widely over the past two years. In my PhD dissertation, I want to provide a perspective that addresses these doubts…
It's a great honor to give a keynote at the @Molecule_Maker symposium at UIUC! Many thanks to Prof. @hengjinlp and Prof. Jiawei Han for invitation. The symposium’s theme this year is “AI scientist? What would it take?”, which I hold close to heart and made a talk titled “Language…
🔧What if your web agent could abstract its experience into programmatic skills—and improve itself autonomously? 🌟 Introducing SkillWeaver: a framework to enable self-improvement through autonomous exploration and constructing an ever-growing library of programmatic skills. 🧠…
🚀Big WebDreamer update! We train 💭Dreamer-7B, a small but strong world model for real-world web planning. 💥Beats Qwen2-72B ⚖️Matches #GPT-4o Trained on 3M synthetic examples — and yes, all data + models are open-sourced.
❓Wondering how to scale inference-time compute with advanced planning for language agents? 🙋♂️Short answer: Using your LLM as a world model 💡More detailed answer: Using GPT-4o to predict the outcome of actions on a website can deliver strong performance with improved safety and…