Sam Bowyer (@sambowyer__)

Pinned

S

Sam Bowyer@sambowyer__ · Mar 4

Our paper on the best way to add error bars to LLM evals is on arXiv! TL;DR: Avoid the Central Limit Theorem -- there are better, simple Bayesian (and frequentist!) methods you should be using instead. Super lightweight library: github.com/sambowyer/baye… 🧵👇

sambowyer__'s tweet image. Our paper on the best way to add error bars to LLM evals is on arXiv! TL;DR: Avoid the Central Limit Theorem -- there are better, simple Bayesian (and frequentist!) methods you should be using instead. Super lightweight library: github.com/sambowyer/baye… 🧵👇

1

22

99

97

16.0K

Sam Bowyer Retweeted

X

Xidulu@xidulu · Jun 12

Thoughts after reading @sambowyer__ 's amazing position paper: Are there more sensible approaches to draw error bar when reporting pass@k than just computing the standard deviation? arxiv.org/abs/2503.01747

0

4

5

0

308

Sam Bowyer Retweeted

B

Ben Anson@benaibean · Jun 2

Is it possible to _derive_ an attention scheme with effective zero-shot generalisation? The answer turns out to be yes! To achieve this, we began by thinking about desirable properties for attention over long contexts, and we distilled 2 key conditions:

12

54

449

509

56.0K

Sam Bowyer Retweeted

L

Laurence Aitchison@laurence_ai · May 6

(Spotlight) LLM evals are increasingly based on tiny datasets (e.g. AIME), so considering uncertainty is becoming critical. We show approaches based on the CLT don't work, and give Bayesian+frequentist alternatives. (@sambowyer__ @desirivanova) arxiv.org/abs/2503.01747

0

3

14

5

791

S

Sam Bowyer@sambowyer__ · May 2

Our position paper on LLM eval error bars has just been accepted to ICML 2025 as a spotlight poster!

SSam Bowyer@sambowyer__ · Mar 4

Our paper on the best way to add error bars to LLM evals is on arXiv! TL;DR: Avoid the Central Limit Theorem -- there are better, simple Bayesian (and frequentist!) methods you should be using instead. Super lightweight library: github.com/sambowyer/baye… 🧵👇

1

10

19

2

2.0K

Sam Bowyer Retweeted

T

Thomas Heap@ThomasEHeap · Mar 12

Link: arxiv.org/abs/2503.08264 Code: github.com/alan-ppl/alan This work was a team effort, I'm very grateful for my collaborators @sambowyer__, and @laurence_ai. Thanks also to @g_leech_ who was involved in the MP-RWS paper.

0

2

6

0

203

S

Sam Bowyer@sambowyer__ · Mar 12

Really happy to have this paper out on arXiv! Scalable GPU-based Bayesian inference for hierarchical models without requiring gradients wrt model parameters (unlike e.g. VI). arxiv.org/abs/2503.08264

TThomas Heap@ThomasEHeap · Mar 12

Our paper Massively Parallel Expectation Maximization For Approximate Posteriors is now on arXiv! In this work we introduce the QEM method for fast approximate posterior estimation in Hierarchical Bayesian models. 🧵👇

0

4

8

0

440

S

Sam Bowyer@sambowyer__ · Mar 6

So no more excuses for not adding error bars (or adding invalid ones 😬)

SSam Bowyer@sambowyer__ · Mar 4

This, along with the CLT's ignorance of typically binary eval data (correct/incorrect responses to an eval question) lead to poor error bars which collapse to zero-width or extend past [0,1].

1

5

0

582

S

Sam Bowyer@sambowyer__ · Mar 6

I’ve been complaining about lack of error bars in LLM papers for some time. Rather than just complaining, here’s a guide on how to do it! ⬇️ We’ve done a small Python lib that you can install… or copy-paste one file into your projects (dependencies are annoying, we get it 🙃)

SSam Bowyer@sambowyer__ · Mar 4

Our paper on the best way to add error bars to LLM evals is on arXiv! TL;DR: Avoid the Central Limit Theorem -- there are better, simple Bayesian (and frequentist!) methods you should be using instead. Super lightweight library: github.com/sambowyer/baye… 🧵👇

2

11

56

24

5.0K

Sam Bowyer Retweeted

E

Edward Milsom@edward_milsom · Feb 25

Our paper "Function-Space Learning Rates" is on arXiv! We give an efficient way to estimate the magnitude of changes to NN outputs caused by a particular weight update. We analyse optimiser dynamics in function space, and enable hyperparameter transfer with our scheme FLeRM! 🧵👇

12

69

425

373

83.0K