Melissa Pan
@melissapan
CS PhD @UCBerkeley Sky Lab 🐻 Systems & AI & Sustainability 🌍 Prev: @google, @ibm, @CarnegieMellon🐕🦺, @UofT🇨🇦
🚨 Why Do Multi-Agent LLM Systems Fail? ⁉️ 🔥 Introducing MAST: The first multi-agent failure taxonomy - consists of 14 failure modes and 3 categories, generalizes for diverse multi-agent systems and tasks! Paper: arxiv.org/pdf/2503.13657 Code: github.com/multi-agent-sy… 🧵1/n

very inspirational
Today, I’m launching a deeply personal project. I’m betting $100M that we can help computer scientists create more upside impact for humanity. Built for and by researchers, including @JeffDean & @jpineau1 on the board, @LaudeInstitute catalyzes research with real-world impact.
Awesome read on Lucene's implementation of ACORN-1🔥🔥 Filtered vector search is everywhere! Efficient, general-purpose (predicate-agnostic) indices that can support those use cases are super, super powerful!! Try it out & check out our original paper dl.acm.org/doi/10.1145/36…
Elasticsearch / Lucene adopts ACORN-1, which expands the exploration of nodes to ensure enough candidates that meet the filter By @benwtrent elastic.co/search-labs/bl…
We at @NovaSkyAI have been hacking on RL across the stack—algorithms, envs, perf optimization. But progress is slowed by RL frameworks with tightly-coupled components that lack interfaces. To fill this gap, we upgraded SkyRL into a highly-modular RL framework. Check it out!!
✨Release: We upgraded SkyRL into a highly-modular, performant RL framework for training LLMs. We prioritized modularity—easily prototype new algorithms, environments, and training logic with minimal overhead. 🧵👇 Blog: novasky-ai.notion.site/skyrl-v01 Code: github.com/NovaSky-AI/Sky…
multi-agent outperforms single agent by 90.2% is very interesting. One reason we haven't seen multi-agents winning is that existing benchmarks are rather "simple." This makes multi-agents seem more like a PoC than a necessity, which is not a true reflection of MAS's capability.
New on the Anthropic Engineering blog: how we built Claude’s research capabilities using multiple agents working in parallel. We share what worked, what didn't, and the engineering challenges along the way. anthropic.com/engineering/bu…
1/N 📢 Introducing UCCL (Ultra & Unified CCL), an efficient collective communication library for ML training and inference, outperforming NCCL by up to 2.5x 🚀 Code: github.com/uccl-project/u… Blog: uccl-project.github.io/posts/about-uc… Results: AllReduce on 6 HGX across 2 racks over RoCE RDMA
We release Search Arena 🌐 — the first large-scale (24k+) dataset of in-the-wild user interactions with search-augmented LLMs. We also share a comprehensive report on user preferences and model performance in the search-enabled setting. Paper, dataset, and code in 🧵
Excited to share SkyRL-SQL, a simple yet effective multi-turn RL pipeline for training LLMs to generate and refine SQL through real database feedback. Rather than one-shot generation, models explore unfamiliar schemas, issue trial queries, reflect on results, and iteratively…
1/N Introducing SkyRL-SQL, a simple, data-efficient RL pipeline for Text-to-SQL that trains LLMs to interactively probe, refine, and verify SQL queries with a real database. 🚀 Early Result: trained on just ~600 samples, SkyRL-SQL-7B outperforms GPT-4o, o4-mini, and SFT model…
Multi-agent LLM systems are exciting, but why do they so often fall short of their promise? A new paper from UC Berkeley, "Why Do Multi-Agent LLM Systems Fail?", offers one of the first systematic answers. The authors introduce MAST (Multi-Agent System Failure Taxonomy),…
1/N Introducing SkyRL-v0, our RL training pipeline enabling efficient RL training for long-horizon, real-environment tasks like SWE-Bench. We also open-source a series of our early trained models to showcase the potential of end-to-end online RL training on long-horizon (20-50…
Real world AI pipelines are often compound, multi-module, and multi-step programs—unlike most RL/GRPO implementations today which optimize a single agent. 🚨 Super excited to release dspy.GRPO, which lets you GRPO tune any arbitrary multi-module, multi-step DSPy program, with…
So many things in the run-up to DSPy 3. Here's a first, EXPERIMENTAL one: 🚨We're releasing dspy.GRPO, an online RL optimizer for DSPy programs Your DSPy code as-is can be dspy.GRPO'ed. Yes, even compound multi-module programs. Led by @NoahZiems @LakshyAAAgrawal @dilarafsoylu.
Berkeley CS Grad Entrepreneurs' Annual Mixer & After Party is happening TODAY at Databricks SF🌉 Excited to host PhDs, faculty, and alumni for an evening of research x startups, featuring panelists: John Schulman, Denis Yarats, Alex Dimakis and moderator Andy Konwinski (1/n)
Best tweet i read today🤣🤣🤣 MAST as a practical tool for hiring 😉
Ask me what I do at work and I will send this paper. This is journal article is most of my job description. arXiv:2503.13657 (cs) [Submitted on 17 Mar 2025] Why Do Multi-Agent LLM Systems Fail? arxiv.org/abs/2503.13657
Super cool paper on the failure modes of 'multi-agent' LM systems. But I'm curious and willing to change my mind on why people expect such systems to become useful. What's the hypothesis behind setting up shallow copies of the same LLM and *just* asking them to talk? For…
Multi-agent systems are supposed to provide a framework for decomposing problems and a mechanism to incorporate competing objectives. Yet, despite the significant progress in AI and reasoning, useful multi-agent systems remain the future (and not the present). Why don't…
Very productive conversations with @melissapan @IntuitMachine @sh_reya @tonychenxyz @cyrusnewday. My tl;dr -> There are at least 4 different concepts here, and it's essential to study them separately. 1) Structured programming to fully express your intent or control on the…
Super cool paper on the failure modes of 'multi-agent' LM systems. But I'm curious and willing to change my mind on why people expect such systems to become useful. What's the hypothesis behind setting up shallow copies of the same LLM and *just* asking them to talk? For…