Aparna Dhinakaran
@aparnadhinak
AI Founder: building @arizeai & @arizephoenix 💙 I post about LLMs, LLMOps, Generative AI, ML and occasionally Amazing Race
Reinforcement Learning in English – Prompt Learning Beyond just Optimization @karpathy tweeted something this week that I think many of us have been feeling: the resurgence of RL is great, but it’s missing the big picture. We believe that the industry chasing traditional RL is…
Scaling up RL is all the rage right now, I had a chat with a friend about it yesterday. I'm fairly certain RL will continue to yield more intermediate gains, but I also don't expect it to be the full story. RL is basically "hey this happened to go well (/poorly), let me slightly…
It's soooo important to actually *look at your data* before jumping before solutions and evals. One of the questions we ask ourselves as an evals platform is how do we best enable teams to look and review their data? Annotation tools are great, but sometimes building your own…
Just wrapped up a tutorial - I use a custom annotations tool to build an end-to-end evaluation & experimentation pipeline🚀 Inspired by an article from @eugeneyan, I explore how to leverage annotations to construct evals, design thoughtful experiments, and systematically improve…
evals are all you need
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains 'We introduce Rubrics as Rewards (RaR), a framework that uses structured, checklist-style rubrics as interpretable reward signals for on-policy training with GRPO. Our best RaR method yields up to a relative…
This was such a great episode! I love it how YouTube surfaces a gem like this now and then. Loved listening to both @_amankhan and @lennysan
I open sourced Sniffly, a tool that analyzes Claude Code logs to help me understand my usage patterns and errors. Key learnings. 1. The biggest type of errors Claude Code made is Content Not Found (20 - 30%). It tries to find files or functions that don't exist. So I…
Most important tech blog this year: OpenAI engineer and ex-founder of $3.5B Segment wrote a tell all post about how OpenAI works internally. From obsession with X, devout use of Slack to engineering culture and tech stack. A peek under the hood of a generational company.
Scaling up RL is all the rage right now, I had a chat with a friend about it yesterday. I'm fairly certain RL will continue to yield more intermediate gains, but I also don't expect it to be the full story. RL is basically "hey this happened to go well (/poorly), let me slightly…
The secret to prompt optimization is evals Saw this tweet by Jason Liu and got me thinking about the future of prompt optimization Most of us are in Cursor/Claude Code and it makes a ton of sense to keep prompts close to code and iterate on them with AI code editors The hard…
holy shit lmfao claude code has been writing a prompt, looking at 200 failures and updated the prmopt, it went from v1 recall@1 60 -> 80 v2 recall@1 6 -> 52
One of our most enthusiastic students Pawel took the evals FAQ and upgraded it 😍 check it out
I got permission to publish this massive EI Evals FAQ (PDF). It's like a bible for AI engineers and AI PMs. @HamelHusain and @sh_reya answer the most common questions they got while teaching 700+ students. And share 30+ free videos, posts, and resources: 🧵
Trying to come up with the manifesto for an OSS evals library. Initial thoughts: •Speed - Speed should be a distinct advantage of using these evals over others. This may come at a trade-off of accuracy at times and should be weighed but in general speed of iteration should be…
Knowledge makes the world so much more beautiful.
+1 for "context engineering" over "prompt engineering". People associate prompts with short task descriptions you'd give an LLM in your day-to-day use. When in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window…
I really like the term “context engineering” over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM.