Sayash Kapoor
@sayashk
CS PhD candidate @PrincetonCITP and senior fellow at @Mozilla. I tweet about agents, evaluation, reproducibility, AI for science. Book: http://aisnakeoil.com
The mainstream view of AI for science says AI will rapidly accelerate science, and that we're on track to cure cancer, double the human lifespan, colonize space, and achieve a century of progress in the next decade. In a new AI Snake Oil essay, @random_walker and I argue that…



The origin story of “AI as Normal Technology”, and lessons learned Many people have asked how the “AI as Normal Technology” paper came to be. This paper has been an (ongoing) journey for me and @sayashk in developing not just the substance of our arguments but also learning how…
New blog post alert! 🚨"What is the Hugging Face Community Building?", with @YJernite and @IreneSolaiman The AI narrative focuses on big players, but the real story is happening in the open source AI ecosystem across 1.8M models, 450K datasets, and 560K apps, on @huggingface.
If we compared AI capabilities against humans with no access to tools, such as the internet, we would probably find that AI already outperformed humans at many or most cognitive tasks we perform at work. But of course this is not a helpful comparison and doesn’t tell us much…
In the running for my favorite blog post from Sayash and Arvind! When people ask me for areas I am most excited about for AI, my answer is often some version of "science-first AI-for-science". There is a lot to do to figure out what that means and how to do it well.
The mainstream view of AI for science says AI will rapidly accelerate science, and that we're on track to cure cancer, double the human lifespan, colonize space, and achieve a century of progress in the next decade. In a new AI Snake Oil essay, @random_walker and I argue that…
We ourselves are enthusiastic users of AI in our scientific workflows. On a day-to-day basis, it all feels very exciting. But the impact of AI on science as an institution, rather than individual scientists, is a different question that demands a different kind of analysis.…
Some aspects of AI discourse seem to come from a different planet, oblivious to basic realities on Earth. AI for science is one such area. In this new essay, @sayashk and I argue that visions of accelerating science through AI should be considered unserious if they don't confront…
Are LLMs correlated when they make mistakes? In our new ICML paper, we answer this question using responses of >350 LLMs. We find substantial correlation. On one dataset, LLMs agree on the wrong answer ~2x more than they would at random. 🧵(1/7)
After we invented the dynamo, it took us 40 years to electrify factories. In the process, we had to redesign the entire factory layout — electrifying existing factories didn't cut it. Software engineering will likewise need to undergo drastic changes to truly benefit from AI.…
We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.
We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.
How much time do AI coding tools save? @METR_Evals just released a rigorous study with a startling result: developers take 19% longer to complete tasks when using AI! The result is consistent with the idea that AI tools are most helpful for routine work in small projects,…
🎙️ New #InterpretingIndia episode! @NidhiSinghLive joins @sayashk to explore the hype, hope, and hazards of artificial intelligence. From flawed predictions to grounded policymaking, Kapoor makes the case for treating #AI as “normal technology.” Tune in: carnegieindia.org/podcasts/inter…
As AI agents near real-world use, how do we know what they can actually do? Reliable benchmarks are critical but agentic benchmarks are broken! Example: WebArena marks "45+8 minutes" on a duration calculation task as correct (real answer: "63 minutes"). Other benchmarks…
AI tools can detect truck driver fatigue and prevent deadly crashes. But the Teamsters are blocking their rollout. My latest @WSJ:
When coding with agents, my ideal GUI for context engineering would look like this. Key features: * Visually pick, resize, reorder what goes into the context. * The user is not forced to do all this manually; the agent is capable of auto-populating and the user can review/tweak.…
A post by Stripe engineer @thegautam on building a successful payments foundation model for fraud detection recently went viral. I want to talk about how unusual this particular use case is, which helps understand why such "instant wins" from deploying advanced AI are so rare. As…