Bobby (@bobby_he)

Pinned

B

Bobby@bobby_he · Jun 10, 2024

Outlier Features (OFs) aka “neurons with big features” emerge in standard transformer training & prevent benefits of quantisation🥲but why do OFs appear & which design choices minimise them? Our new work (+@lorenzo_noci @DanielePaliotta @ImanolSchlag T. Hofmann) takes a look👀🧵

bobby_he's tweet image. Outlier Features (OFs) aka “neurons with big features” emerge in standard transformer training &amp; prevent benefits of quantisation🥲but why do OFs appear &amp; which design choices minimise them?

Our new work (+@lorenzo_noci @DanielePaliotta @ImanolSchlag T. Hofmann) takes a look👀🧵

4

39

183

111

49.0K

Bobby Retweeted

Y

Yuhui Ding@yuhui_ding · Jul 2

Is equivariance necessary for a good 3D molecule generative model? Check out our #icml2025 paper, which closes the performance gap between non-equivariant and equivariant diffusion models via rotational alignment, while also being more efficient (1/7): arxiv.org/abs/2506.10186

3

11

40

13

5.0K

B

Bobby@bobby_he · Apr 24

🚀 TOMORROW afternoon at ICLR: Learn about the directionality of optimization trajectories in neural nets and how it inspires a potential way to make LLM pretraining more efficient ♻️ (Poster# 585, hall 2b)

SSidak Pal Singh@unregularized · Jul 12, 2024

Ever wondered how the optimization trajectories are like when training neural nets & LLMs🤔? Do they contain a lot of twists 💃 and turns, or does the direction largely remain the same🛣️? We explore this in our work for LLMs (upto 12B params) + ResNets on ImageNet. Key findings👇

0

1

6

2

2.0K

Bobby Retweeted

D

Dimitri von Rütte@dvruette · Mar 10

🚨 NEW PAPER DROP! Wouldn't it be nice if LLMs could spot and correct their own mistakes? And what if we could do so directly from pre-training, without any SFT or RL? We present a new class of discrete diffusion models, called GIDD, that are able to do just that: 🧵1/12

21

161

1.0K

928

138.0K

Bobby Retweeted

A

AK@_akhaliq · Feb 28

FlexiDiT Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute

4

43

296

141

29.0K

B

Bobby@bobby_he · Feb 14

A really fun project to work on. Looking at these plots side-by-side still amazes me! How well can **convex optimization theory** match actual LLM runs? My favorite points of our paper on the agreement for LR schedules in theory and practice: 1/n

FFabian Schaipp@FSchaipp · Feb 5

Learning rate schedules seem mysterious? Turns out that their behaviour can be described with a bound from *convex, nonsmooth* optimization. Short thread on our latest paper 🚇 arxiv.org/abs/2501.18965

1

5

44

27

5.0K

Bobby Retweeted

T

Tiago Pimentel@tpimentelms · Dec 20

BPE is a greedy method to find a tokeniser which maximises compression! Why don't we try to find properly optimal tokenisers instead? Well, it seems this is a very difficult—in fact, NP-complete—problem!🤯 New paper + P. Whittington, @GregorBachmann1 :) arxiv.org/abs/2412.15210

6

83

428

265

35.0K

B

Bobby@bobby_he · Dec 13

Come by poster #2402 East hall at NeurIPS from 11am-2pm Friday to chat about why outlier features emerge during training and how we can prevent them!

BBobby@bobby_he · Nov 8

Updated camera ready arxiv.org/abs/2405.19279. New results include: - non-diagonal preconditioners (SOAP/Shampoo) minimise OFs compared to diagonal (Adam/AdaFactor) - Scaling to 7B params - showing our methods to reduce OFs translate to PTQ int8 quantisation ease. Check it out!

0

10

45

11

5.0K

B

Bobby@bobby_he · Nov 8

Updated camera ready arxiv.org/abs/2405.19279. New results include: - non-diagonal preconditioners (SOAP/Shampoo) minimise OFs compared to diagonal (Adam/AdaFactor) - Scaling to 7B params - showing our methods to reduce OFs translate to PTQ int8 quantisation ease. Check it out!

BBobby@bobby_he · Jun 10, 2024

Outlier Features (OFs) aka “neurons with big features” emerge in standard transformer training & prevent benefits of quantisation🥲but why do OFs appear & which design choices minimise them? Our new work (+@lorenzo_noci @DanielePaliotta @ImanolSchlag T. Hofmann) takes a look👀🧵

1

31

154

98

38.0K

Bobby Retweeted

Y

Yuhui Ding@yuhui_ding · Jul 22, 2024

Tuesday 1:30pm-3pm, Hall C 4-9 #515. Drop by our poster if you are interested in SSMs for graphs👇! Code: github.com/skeletondyh/GR…

4

3

11

1

2.0K

B

Bobby@bobby_he · Jul 21, 2024

Heading to Vienna tomorrow for ICML! Broke up the train journey to catch a concert celebrating the 200th birth year of Bruckner, near the stunning Attersee 🏞 Looking forward to catching up with old friends and meeting new ones next week 😊

bobby_he's tweet image. Heading to Vienna tomorrow for ICML!

Broke up the train journey to catch a concert celebrating the 200th birth year of Bruckner, near the stunning Attersee 🏞

Looking forward to catching up with old friends and meeting new ones next week 😊

0

13

0

672

Bobby Retweeted

S

Sidak Pal Singh@unregularized · Jul 12, 2024

Ever wondered how the optimization trajectories are like when training neural nets & LLMs🤔? Do they contain a lot of twists 💃 and turns, or does the direction largely remain the same🛣️? We explore this in our work for LLMs (upto 12B params) + ResNets on ImageNet. Key findings👇

2

10

63

39

10.0K