Bobby
@bobby_he
Machine Learning postdoc @ETH. PhD from @UniofOxford and former research intern @DeepMind/@samsungresearch
Outlier Features (OFs) aka “neurons with big features” emerge in standard transformer training & prevent benefits of quantisation🥲but why do OFs appear & which design choices minimise them? Our new work (+@lorenzo_noci @DanielePaliotta @ImanolSchlag T. Hofmann) takes a look👀🧵

Is equivariance necessary for a good 3D molecule generative model? Check out our #icml2025 paper, which closes the performance gap between non-equivariant and equivariant diffusion models via rotational alignment, while also being more efficient (1/7): arxiv.org/abs/2506.10186
🚀 TOMORROW afternoon at ICLR: Learn about the directionality of optimization trajectories in neural nets and how it inspires a potential way to make LLM pretraining more efficient ♻️ (Poster# 585, hall 2b)
Ever wondered how the optimization trajectories are like when training neural nets & LLMs🤔? Do they contain a lot of twists 💃 and turns, or does the direction largely remain the same🛣️? We explore this in our work for LLMs (upto 12B params) + ResNets on ImageNet. Key findings👇
🚨 NEW PAPER DROP! Wouldn't it be nice if LLMs could spot and correct their own mistakes? And what if we could do so directly from pre-training, without any SFT or RL? We present a new class of discrete diffusion models, called GIDD, that are able to do just that: 🧵1/12
FlexiDiT Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute
A really fun project to work on. Looking at these plots side-by-side still amazes me! How well can **convex optimization theory** match actual LLM runs? My favorite points of our paper on the agreement for LR schedules in theory and practice: 1/n
Learning rate schedules seem mysterious? Turns out that their behaviour can be described with a bound from *convex, nonsmooth* optimization. Short thread on our latest paper 🚇 arxiv.org/abs/2501.18965
BPE is a greedy method to find a tokeniser which maximises compression! Why don't we try to find properly optimal tokenisers instead? Well, it seems this is a very difficult—in fact, NP-complete—problem!🤯 New paper + P. Whittington, @GregorBachmann1 :) arxiv.org/abs/2412.15210
Come by poster #2402 East hall at NeurIPS from 11am-2pm Friday to chat about why outlier features emerge during training and how we can prevent them!
Updated camera ready arxiv.org/abs/2405.19279. New results include: - non-diagonal preconditioners (SOAP/Shampoo) minimise OFs compared to diagonal (Adam/AdaFactor) - Scaling to 7B params - showing our methods to reduce OFs translate to PTQ int8 quantisation ease. Check it out!
Updated camera ready arxiv.org/abs/2405.19279. New results include: - non-diagonal preconditioners (SOAP/Shampoo) minimise OFs compared to diagonal (Adam/AdaFactor) - Scaling to 7B params - showing our methods to reduce OFs translate to PTQ int8 quantisation ease. Check it out!
Outlier Features (OFs) aka “neurons with big features” emerge in standard transformer training & prevent benefits of quantisation🥲but why do OFs appear & which design choices minimise them? Our new work (+@lorenzo_noci @DanielePaliotta @ImanolSchlag T. Hofmann) takes a look👀🧵
Tuesday 1:30pm-3pm, Hall C 4-9 #515. Drop by our poster if you are interested in SSMs for graphs👇! Code: github.com/skeletondyh/GR…
Heading to Vienna tomorrow for ICML! Broke up the train journey to catch a concert celebrating the 200th birth year of Bruckner, near the stunning Attersee 🏞 Looking forward to catching up with old friends and meeting new ones next week 😊




Ever wondered how the optimization trajectories are like when training neural nets & LLMs🤔? Do they contain a lot of twists 💃 and turns, or does the direction largely remain the same🛣️? We explore this in our work for LLMs (upto 12B params) + ResNets on ImageNet. Key findings👇