Nick Jiang (@nickhjiang)

Pinned

N

Nick Jiang@nickhjiang · Jun 10

Vision transformers have high-norm outliers that hurt performance and distort attention. While prior work removed them by retraining with “register” tokens, we find the mechanism behind outliers and make registers at ✨test-time✨—giving clean features and better performance! 🧵

nickhjiang's tweet image. Vision transformers have high-norm outliers that hurt performance and distort attention. While prior work removed them by retraining with “register” tokens, we find the mechanism behind outliers and make registers at ✨test-time✨—giving clean features and better performance! 🧵

15

134

997

825

128.0K

N

Nick Jiang@nickhjiang · Jul 15

So much of research centers on hope and faith that ppl who don’t take leaps of faith are unlikely to enjoy research

3

0

10

1

745

N

Nick Jiang@nickhjiang · Jun 30

Updated paper! Our main new finding: by creating attention biases at test time—without extra tokens—we remove high-norm outliers and attention sinks in ViTs, while preserving zero-shot ImageNet performance. Maybe ViTs don’t need registers after all? x.com/nickhjiang/sta…

NNick Jiang@nickhjiang · Jun 10

Vision transformers have high-norm outliers that hurt performance and distort attention. While prior work removed them by retraining with “register” tokens, we find the mechanism behind outliers and make registers at ✨test-time✨—giving clean features and better performance! 🧵

3

35

189

103

18.0K

Nick Jiang Retweeted

N

Nikhil Prakash@nikhil07prakash · Jun 24

How do language models track mental states of each character in a story, often referred to as Theory of Mind? Our recent work takes a step in demystifing it by reverse engineering how Llama-3-70B-Instruct solves a simple belief tracking task, and surprisingly found that it…

9

96

567

623

94.0K

N

Nick Jiang@nickhjiang · Jun 12

Very interesting work. These outliers are the same outliers as in LLM.int8() and the attention sinks papers and suggest that outliers could be handled through test-time adaptations. LLMs are trickier due to natural registers (BOS token), but a similar approach might work.

NNick Jiang@nickhjiang · Jun 10

Vision transformers have high-norm outliers that hurt performance and distort attention. While prior work removed them by retraining with “register” tokens, we find the mechanism behind outliers and make registers at ✨test-time✨—giving clean features and better performance! 🧵

6

12

113

66

10.0K

N

Nick Jiang@nickhjiang · Jun 10

This is really cool and useful vision work and will solve many of the problems I’ve been having.

NNick Jiang@nickhjiang · Jun 10

Vision transformers have high-norm outliers that hurt performance and distort attention. While prior work removed them by retraining with “register” tokens, we find the mechanism behind outliers and make registers at ✨test-time✨—giving clean features and better performance! 🧵

1

79

11

7.0K

N

Nick Jiang@nickhjiang · Jun 10

One of the most interesting papers to read at the moment with extreme implications also for language transformers.

NNick Jiang@nickhjiang · Jun 10

Vision transformers have high-norm outliers that hurt performance and distort attention. While prior work removed them by retraining with “register” tokens, we find the mechanism behind outliers and make registers at ✨test-time✨—giving clean features and better performance! 🧵

1

2

10

5

2.0K

Nick Jiang Retweeted

A

Amil Dravid@_AmilDravid · Jun 10

Artifacts in your attention maps? Forgot to train with registers? Use 𝙩𝙚𝙨𝙩-𝙩𝙞𝙢𝙚 𝙧𝙚𝙜𝙞𝙨𝙩𝙚𝙧𝙨! We find a sparse set of activations set artifact positions. We can shift them anywhere ("Shifted") — even outside the image into an untrained token. Clean maps, no retrain.

4

61

325

210

43.0K