Hamel Husain
@HamelHusain
Evals evals evals http://bit.ly/evals-ai About Me: https://hamel.dev
I built an API-first link shortener for agents. Because I was tired of paying @Bitly $40/month for 3 damn links and spend all my time in @claude_code. Meet tny.dev 🎥 Video demo below 👇
evals are all you need
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains 'We introduce Rubrics as Rewards (RaR), a framework that uses structured, checklist-style rubrics as interpretable reward signals for on-policy training with GRPO. Our best RaR method yields up to a relative…
I was also pleasantly surprised the other day. The only fustrating thing is that I can't really put my finger on what makes it better. It seems faster, but there's something else too ...
Ya’ll @AmpCode is good. Try it. Thank me later. It’s expensive but not as expensive as your time.
Ya’ll @AmpCode is good. Try it. Thank me later. It’s expensive but not as expensive as your time.
in case you are wondering this is academia now
ICML’s Statement about subversive hidden LLM prompts We live in a weird timeline…
Most people don't realize how good OSS tools for AI coding are. Join @intellectronica and me as she summarizes the open-source landscape and highlights the top ones you should incorporate into your workflow. maven.com/p/2cb739/oss-i…
Just published a blog post where I highlight 10 ideas that stood out to me from the first lesson and first three chapters of the course reader from the AI evals course taught by @HamelHusain and @sh_reya. vishalbakshi.github.io/blog/posts/202…
My talk for @aiDotEngineer on what I think every person working with language models needs to know about GPUs is now available! - Latency lags bandwidth. - GPUs embrace bandwidth. - Don't be scared of N squared. - Use the Tensor Cores, Luke! youtube.com/watch?v=y-UGrY…
Today we restarted the @HamelHusain / @sh_reya evals course and to accompany this first week's class I'm publishing the first part of a series of annotated posts to accompany the course textbook. (Link in the 🧵) The aim was to give more examples from the @zenml_io LLMOps…
IMO the hardest part of web dev is getting social cards to work properly
I'm launching Context Engineering For Coding to make AI assisted coding more efficient, and it works with Cursor, Claude Code, Copilot, all of them. Here's a 30% off discount link for early enrollers maven.com/kentro/context…
This is a great post by @sanjanayed and aligns well with what @HamelHusain and @sh_reya pitch in their evals course as well. You don't want to outsource your annotations. It makes a lot of sense to use tools that let you build your own annotation tools (using @v0, @lovable_dev…
Just wrapped up a tutorial - I use a custom annotations tool to build an end-to-end evaluation & experimentation pipeline🚀 Inspired by an article from @eugeneyan, I explore how to leverage annotations to construct evals, design thoughtful experiments, and systematically improve…
ngl I'm most excited about this cage match between Eval vendors. They are going to solve the homework assignments, side-by-side. @hwchase17 (Langsmith) vs @mikeldking (Phoenix) vs @waydegilliam (Braintrust) maven.com/parlance-labs/…
Excited to kick off a much improved version of our AI evals course tomorrow (link in replies). 💫 We've added dedicated homework sessions, an updated course reader & lectures that incorporates 100s of questions from cohort 1. There’s more hands-on/live error analysis, plus…
Fairly convincing phishing attempt ... watch out folks don't fall for this (email was from [email protected])

🎯 Benchmarks vs. Evals: How I learned to tell the difference by remembering my dating days Picture this: Your friends set you up on a blind date. 💑 They tell you everything: • Tall ✓ • Deep blue eyes ✓ • Shiny brown hair ✓ • Economics PhD ✓ • Volleyball enthusiast ✓…
Excited to kick off a much improved version of our AI evals course tomorrow (link in replies). 💫 We've added dedicated homework sessions, an updated course reader & lectures that incorporates 100s of questions from cohort 1. There’s more hands-on/live error analysis, plus…
They tell you 2025 is the year of AI agents, and yes, that’s true in many ways. But it’s also becoming the year of evaluation. We’ve got startling models and tooling, but now we’re asking what’s working, what’s not, and how do we measure it? I recently took @HamelHusain and…