Wayde Gilliam
@waydegilliam
Helping customers look at their data @braintrustdata
This is 🔥
I've known @brettberson for over a decade, so this was a really fun and candid conversation about Braintrust and some of my wacky opinions about how to build a product. I think he's a world class interviewer and feel honored to be part of this series. Take a listen :)
We've learned critical lessons from helping teams ship reliable LLM-powered products. Organizations using Braintrust run over 3,000 evaluations daily, providing us with unique insights into what actually works. Read more on the blog: braintrust.dev/blog/five-less…
Reorganized the evals FAQ into categories, since there are so many now! You can also download the FAQ in different formats (pdf, markdown) from the sidebar on the page directly. hamel.dev/blog/posts/eva…
If you want ONE place to keep up with AI coding agents, you should pay attention to what @isaac_flath and @intellectronica are putting together: bit.ly/coding-ai. I've worked with both, and they have phenomenal taste. Isaac: 7+ years working on dev tools in both open…
Qwen released their updated "thinking" model today. It thinks really hard! Took 166 seconds to think through the details of drawing me a pelican on a bicycle. The finished drawing wasn't great but the thoughts behind it were fun to see. simonwillison.net/2025/Jul/25/qw…
🚀 We’re excited to introduce Qwen3-235B-A22B-Thinking-2507 — our most advanced reasoning model yet! Over the past 3 months, we’ve significantly scaled and enhanced the thinking capability of Qwen3, achieving: ✅ Improved performance in logical reasoning, math, science & coding…
evals are all you need
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains 'We introduce Rubrics as Rewards (RaR), a framework that uses structured, checklist-style rubrics as interpretable reward signals for on-policy training with GRPO. Our best RaR method yields up to a relative…
something i've been thinking about recently there are no more engineers, designers, PMs, etc there are product owners. product owners write code, solicit feedback, drive roadmap, collaborate, talk to customers, answer support tickets, etc.
GitHub released Spark yesterday, their extremely well crafted prompt-to-app platform for creating and iterating on React apps with user auth and persistent storage I like it a lot! I reverse engineered it with Spark itself, the details are fascinating simonwillison.net/2025/Jul/24/gi…
Thanks to our new faststripe lib, it's now this easy to integrate @stripe into your python app:
At @answerdotai, we integrate @stripe into lots of projects. Every time, I found myself doing the same dance: create product, create price, create checkout session. Then hunting docs for parameters for each. So we built FastStripe, a self-documenting Stripe SDK that's easy to…
I suspect this is largely because asynchronous programming is a new skill. If you're not good at it, then you won't gain productivity.
Developers using AI coding copilots complete 98% more code changes and 21% more tasks... ...but their companies don't ship faster due to downstream bottlenecks and a 91% increase in code review time Any journalists want to see this study of 10K developers before it publishes?
Cursor Pro Tip: Always end your prompts with: “Explain the full approach you’d take to implement this. Just tell, don’t code.” Cursor will map out its entire plan. Review it, tweak if needed, then let it execute. It makes a HUGE difference in how well Cursor executes your…
things i install when i ssh: * tmux * neovim * claude * uv * gh * htop * dstat
This looks sick!
I'm launching Context Engineering For Coding to make AI assisted coding more efficient, and it works with Cursor, Claude Code, Copilot, all of them. Here's a 30% off discount link for early enrollers maven.com/kentro/context…
ngl I'm most excited about this cage match between Eval vendors. They are going to solve the homework assignments, side-by-side. @hwchase17 (Langsmith) vs @mikeldking (Phoenix) vs @waydegilliam (Braintrust) maven.com/parlance-labs/…
Excited to kick off a much improved version of our AI evals course tomorrow (link in replies). 💫 We've added dedicated homework sessions, an updated course reader & lectures that incorporates 100s of questions from cohort 1. There’s more hands-on/live error analysis, plus…
"Vibe checks" are great—until you need to scale. In this clip, @HamelHusain and @sh_reya break down why relying on human intuition isn’t enough when it comes to evaluating product or model quality at scale. Instead, they explain how to codify those gut checks into scalable,…
We've launched Claude for Financial Services. Claude now integrates with leading data platforms and industry providers for real-time access to comprehensive financial information, verified across internal and industry sources.
Just spotted this for the first time "Simply paste the URL of this blog post into Claude and tell it to set it up for you." Blog posts written for the computer. Amazing. steipete.me/posts/command-…
Here’s a teaser from my opening lecture next week. Join us if you too are interested in seeing me shape shift into a fiery brain as I intro @braintrustdata 🧠+📈 maven.com/parlance-labs/…
people seem to really love Loop. every day, people ask, "how do I build an agent like this?" the answer is simple :) use @braintrustdata
LFG!!! 🧠+🤝
The eval space is the most intense battle for AI market share I have seen second to coding agents. This is why we will have Arize & Braintrust go head-to-head. They will each show how to complete our 5 homework assignments using their tools . Over 1k students learning about…