Edward Milsom

@edward_milsom

Machine learning PhD student working on deep learning and deep kernel methods. Compass CDT, University of Bristol.

Compass, University of Bristol

Joined March 2022

352Following

457Followers

Pinned

Edward Milsom@edward_milsom · Feb 25

Our paper "Function-Space Learning Rates" is on arXiv! We give an efficient way to estimate the magnitude of changes to NN outputs caused by a particular weight update. We analyse optimiser dynamics in function space, and enable hyperparameter transfer with our scheme FLeRM! 🧵👇

edward_milsom's tweet image. Our paper "Function-Space Learning Rates" is on arXiv! We give an efficient way to estimate the magnitude of changes to NN outputs caused by a particular weight update. We analyse optimiser dynamics in function space, and enable hyperparameter transfer with our scheme FLeRM! 🧵👇

424

372

83.0K

Edward Milsom@edward_milsom · Jul 22

Achievement unlocked: The Big 3 Lanyards

184

Edward Milsom Retweeted

Ben Recht@beenwrekt · Jul 3

The NeurIPS paper checklist corroborates the bureaucratic theory of statistics. argmin.net/p/standard-err…

118

49.0K

Edward Milsom@edward_milsom · Jun 13

What's some "must read" literature on generalisation in neural networks? I keep thinking about this paper and it really makes me want to understand better the link between optimisation and generalisation. arxiv.org/abs/2302.12091

224

268

18.0K

Edward Milsom@edward_milsom · Jun 8

Me: Asks literally any question LLM: Excellent! You're really getting to the heart of computer architecture / electrical infrastructure / The history of Barcelona. Don't flatter me LLM, I am aware of my own limitations, even if you are not.

390

Edward Milsom Retweeted

Ben Anson@benaibean · Jun 2

Is it possible to _derive_ an attention scheme with effective zero-shot generalisation? The answer turns out to be yes! To achieve this, we began by thinking about desirable properties for attention over long contexts, and we distilled 2 key conditions:

449

509

56.0K

Edward Milsom@edward_milsom · May 23

This is really a beautiful idea: Autodiff alleviates graduate students' pain from manually deriving the gradient, but MuP-ish work brings the pain back! But this work provides a way that allows you to simply SHUT OFF your brain and get hparam transfer.

EEdward Milsom@edward_milsom · May 23

To address the "parameterisation lottery" (ideas win because they work well with popular choices of e.g. learning rates) I think empirical hyperparameter transfer methods are crucial. Rules like mu-P require you to derive them first, which is painful... x.com/edward_milsom/…

1.0K

Edward Milsom Retweeted

Laurence Aitchison@laurence_ai · May 6

Happy to announce that my lab has four papers accepted at ICML, including one spotlight:

12.0K

Edward Milsom@edward_milsom · May 6

It seems none of the big open-source models are using mu-P still (correct me if I'm wrong!). According to this it should be quite easy: cerebras.ai/blog/the-pract… Are there any major drawbacks to using mu-P? (I'd be very surprised if Grok wasn't using it because Greg Yang.)

edward_milsom's tweet card. Cerebras is the go-to platform for fast and effortless AI training. Learn more at cerebras.ai.

2.0K

Edward Milsom@edward_milsom · May 2

Our position paper on LLM eval error bars has just been accepted to ICML 2025 as a spotlight poster!

SSam Bowyer@sambowyer__ · Mar 4

Our paper on the best way to add error bars to LLM evals is on arXiv! TL;DR: Avoid the Central Limit Theorem -- there are better, simple Bayesian (and frequentist!) methods you should be using instead. Super lightweight library: github.com/sambowyer/baye… 🧵👇

2.0K

Edward Milsom@edward_milsom · May 1

I talked to a lot of people about "a weight decay paper from Wang and Aitchison" at ICLR, which is officially been accepted at #ICML2025 . Laurence summarized the stuff in our paper in the post, here I will talk about the connection with a *broad* collection of existing works 1/

LLaurence Aitchison@laurence_ai · May 28, 2024

1/ Super proud of our recent work on how to change the AdamW weight decay as you scale model + dataset size. Or how μP is broken and how to fix it. arxiv.org/abs/2405.13698…

4.0K

Edward Milsom@edward_milsom · May 2

Function-Space Learning Rates has been accepted to ICML 2025! Go read about our paper here: x.com/edward_milsom/…

EEdward Milsom@edward_milsom · Feb 25

136

9.0K

Edward Milsom Retweeted

will brown@willccbb · Apr 24

singapore looks so cool i should've done more ablations

227

17.0K

Edward Milsom@edward_milsom · Apr 19

wow, didnt know cs336 cover scaling things. scaling law, critical bsz, muP and so on. (this lecture slide screenshot is from 2024)

SStanford NLP Group@stanfordnlp · Apr 19

Want to learn the engineering details of building state-of-the-art Large Language Models (LLMs)? Not finding much info in @OpenAI’s non-technical reports? @percyliang and @tatsu_hashimoto are here to help with CS336: Language Modeling from Scratch, now rolling out to YouTube.

312

273

31.0K

Edward Milsom@edward_milsom · Apr 11

Easy (but informative) exercise: Show by induction that an exponential moving average is distributive i.e. EMA(\sum_i X_i)_t = \sum_i EMA(X_i)_t. What EMA initialisation strategies make the base case hold?

150

Edward Milsom@edward_milsom · Mar 28

There's a lot to process here, but I was pleased to see that Anthropic's 'Circuit Tracing' paper cites three of our recent contributions to the interpretability literature! 1/

AAnthropic@AnthropicAI · Mar 27

For more, read our papers: On the Biology of a Large Language Model contains an interactive explanation of each case study: transformer-circuits.pub/2025/attributi… Circuit Tracing explains our technical approach in more depth: transformer-circuits.pub/2025/attributi…

13.0K

Edward Milsom Retweeted

François Fleuret@francoisfleuret · Mar 23

If you make me president, the login node will have GPUs.

142

11.0K

Edward Milsom Retweeted

Jeremy Bernstein@jxbz · Mar 7

I just wrote my first blog post in four years! It is called "Deriving Muon". It covers the theory that led to Muon and how, for me, Muon is a meaningful example of theory leading practice in deep learning (1/11)

132

930

916

108.0K