Shreya Shankar
@sh_reya
doing a PhD @Berkeley_EECS, building http://docetl.org | teaching http://bit.ly/evals-ai | formerly ML eng & undergrad @Stanford CS
LLMs have made exciting progress on hard tasks! But they still struggle to analyze complex, unstructured documents (including today's Gemini 1.5 Pro 002). We (UC Berkeley) built 📜DocETL, an open-source, low-code system for LLM-powered data processing: data-people-group.github.io/blogs/2024/09/…

If you want ONE place to keep up with AI coding agents, you should pay attention to what @isaac_flath and @intellectronica are putting together: bit.ly/coding-ai. I've worked with both, and they have phenomenal taste. Isaac: 7+ years working on dev tools in both open…
To what extent can you automate or delegate evals? Is there a way to make it "not your problem?" 😅 Part 1 of 1: You should absolutely automate parts of it as long as a human is in the loop. Many people are a bit too aggressive here, so you have to be careful, some guidelines…
the biggest bottleneck in my workflow is communicating my intentions clearly and no amount of model intelligence increase will solve that for me unless we are truly symbiotic which i don't really want anyway. or i can cede my agency to its choices which i don't want either
We’ve extended enrollment in our **last** live cohort on AI Evals until the end of this week! Here’s the syllabus (2 lessons per week): Week 1: Fundamentals & Lifecycle LLM Application Evaluation, Systematic Error Analysis Week 2: Implementing Effective Evaluations,…
"failure mode taxonomy" is a good abstraction
Stop wasting time guessing why your AI fails. The most valuable skill I learned recently: error analysis maven.com/parlance-labs/… Hamel & Shreya teach you how to diagnose what's going wrong with your pipeline, and build evals you can trust at scale. Error analysis is just the…
Really like this set of standout ideas. We say a million things in the course reader and I love hearing what sticks / what's practical
Just published a blog post where I highlight 10 ideas that stood out to me from the first lesson and first three chapters of the course reader from the AI evals course taught by @HamelHusain and @sh_reya. vishalbakshi.github.io/blog/posts/202…
Today we restarted the @HamelHusain / @sh_reya evals course and to accompany this first week's class I'm publishing the first part of a series of annotated posts to accompany the course textbook. (Link in the 🧵) The aim was to give more examples from the @zenml_io LLMOps…
This is a great post by @sanjanayed and aligns well with what @HamelHusain and @sh_reya pitch in their evals course as well. You don't want to outsource your annotations. It makes a lot of sense to use tools that let you build your own annotation tools (using @v0, @lovable_dev…
Just wrapped up a tutorial - I use a custom annotations tool to build an end-to-end evaluation & experimentation pipeline🚀 Inspired by an article from @eugeneyan, I explore how to leverage annotations to construct evals, design thoughtful experiments, and systematically improve…
ngl I'm most excited about this cage match between Eval vendors. They are going to solve the homework assignments, side-by-side. @hwchase17 (Langsmith) vs @mikeldking (Phoenix) vs @waydegilliam (Braintrust) maven.com/parlance-labs/…
Excited to kick off a much improved version of our AI evals course tomorrow (link in replies). 💫 We've added dedicated homework sessions, an updated course reader & lectures that incorporates 100s of questions from cohort 1. There’s more hands-on/live error analysis, plus…
They tell you 2025 is the year of AI agents, and yes, that’s true in many ways. But it’s also becoming the year of evaluation. We’ve got startling models and tooling, but now we’re asking what’s working, what’s not, and how do we measure it? I recently took @HamelHusain and…
Best AI teams obsess over measurement and iteration. If you streamline your AI evals, all other activities become easy. But we can't simply take CI/CD from traditional software or ML. Why?
"Vibe checks" are great—until you need to scale. In this clip, @HamelHusain and @sh_reya break down why relying on human intuition isn’t enough when it comes to evaluating product or model quality at scale. Instead, they explain how to codify those gut checks into scalable,…
A step-by-step guide to diffusion models: bit.ly/4kw0uKo v/@goyal__pramod
The eval space is the most intense battle for AI market share I have seen second to coding agents. This is why we will have Arize & Braintrust go head-to-head. They will each show how to complete our 5 homework assignments using their tools . Over 1k students learning about…
If you like the FAQ, it pales in comparison to the textbook she wrote (not kidding) humbly named “course notes” 😅 Yes, we will we release a book at some point but the best way to learn evals is interactive practice, examples, and different perspectives. We bring that all…
just gonna leave this here if anyone is wondering how it feels to write the curriculum for a course on a hot topic
Just published summaries + a brief analysis of 287 LLMOps case studies from the past few months over on the @zenml_io blog. Some observations about what's actually happening in production AI: - Agents are real now, but not what we expected Most successful production agents are…