Sam Bowyer
@sambowyer__
Bristol ML PhD Student, Compass CDT
Our paper on the best way to add error bars to LLM evals is on arXiv! TL;DR: Avoid the Central Limit Theorem -- there are better, simple Bayesian (and frequentist!) methods you should be using instead. Super lightweight library: github.com/sambowyer/baye… 🧵👇

Thoughts after reading @sambowyer__ 's amazing position paper: Are there more sensible approaches to draw error bar when reporting pass@k than just computing the standard deviation? arxiv.org/abs/2503.01747
Is it possible to _derive_ an attention scheme with effective zero-shot generalisation? The answer turns out to be yes! To achieve this, we began by thinking about desirable properties for attention over long contexts, and we distilled 2 key conditions:
(Spotlight) LLM evals are increasingly based on tiny datasets (e.g. AIME), so considering uncertainty is becoming critical. We show approaches based on the CLT don't work, and give Bayesian+frequentist alternatives. (@sambowyer__ @desirivanova) arxiv.org/abs/2503.01747
Our position paper on LLM eval error bars has just been accepted to ICML 2025 as a spotlight poster!
Our paper on the best way to add error bars to LLM evals is on arXiv! TL;DR: Avoid the Central Limit Theorem -- there are better, simple Bayesian (and frequentist!) methods you should be using instead. Super lightweight library: github.com/sambowyer/baye… 🧵👇
Link: arxiv.org/abs/2503.08264 Code: github.com/alan-ppl/alan This work was a team effort, I'm very grateful for my collaborators @sambowyer__, and @laurence_ai. Thanks also to @g_leech_ who was involved in the MP-RWS paper.
Really happy to have this paper out on arXiv! Scalable GPU-based Bayesian inference for hierarchical models without requiring gradients wrt model parameters (unlike e.g. VI). arxiv.org/abs/2503.08264
Our paper Massively Parallel Expectation Maximization For Approximate Posteriors is now on arXiv! In this work we introduce the QEM method for fast approximate posterior estimation in Hierarchical Bayesian models. 🧵👇
So no more excuses for not adding error bars (or adding invalid ones 😬)
This, along with the CLT's ignorance of typically binary eval data (correct/incorrect responses to an eval question) lead to poor error bars which collapse to zero-width or extend past [0,1].
I’ve been complaining about lack of error bars in LLM papers for some time. Rather than just complaining, here’s a guide on how to do it! ⬇️ We’ve done a small Python lib that you can install… or copy-paste one file into your projects (dependencies are annoying, we get it 🙃)
Our paper on the best way to add error bars to LLM evals is on arXiv! TL;DR: Avoid the Central Limit Theorem -- there are better, simple Bayesian (and frequentist!) methods you should be using instead. Super lightweight library: github.com/sambowyer/baye… 🧵👇
Our paper "Function-Space Learning Rates" is on arXiv! We give an efficient way to estimate the magnitude of changes to NN outputs caused by a particular weight update. We analyse optimiser dynamics in function space, and enable hyperparameter transfer with our scheme FLeRM! 🧵👇