TianyLin

@tianylin

DL practitioner #Ihavecats

Joined May 2020

534Following

489Followers

Pinned

TianyLin@tianylin · Apr 27

Announcing 𝐟𝐥𝐚𝐬𝐡-𝐦𝐮𝐨𝐧: a 🐍 pkg with customized CUDA kernel that aims to boost Muon optimizer: github.com/nil0x9/flash-m… 1/n

tianylin's tweet card. Flash-Muon: An Efficient Implementation of Muon Optimizer - nil0x9/flash-muon

249

136

20.0K

TianyLin@tianylin · Jul 22

very insightful

SShuangfei Zhai@zhaisf · Jul 22

There is a more general version of this question: why not scale up the parameters of the attention operation and make it more expressive? (you can do it as suggested below, or simply increase the dimension of QKV) The empirical answer is that it’s not nearly as effective as…

147

TianyLin@tianylin · Jul 21

They got k2 in contributor list of k2 report lol

�🐻熊狸@bigeagle_xd · Jul 21

just finished the tech report and pushed to github. good night.

2.0K

TianyLin@tianylin · Jul 7

It’s not a bug if you use it right! Some work use similar property to accelerate model inferencing, e.g. SliceGPT and LaRoSa (except they use orthogonal matrices to avoid P inverse here).

WWilliam Wale@williawa · Jul 6

Lol its worse than this. You can multiply in arbitrary invertible matrices. OV = OIV = (OP^-1) (P V) Which means pretty much the whole of V is redundant in a sense. The only thing that matter is its rank, everything else can be packed into V. This actually holds anywhere you…

177

TianyLin Retweeted

samsja@samsja19 · Jul 4

You can fix it with doing the softmax in fp32 arxiv.org/abs/2506.13585

9.0K

TianyLin@tianylin · Jul 2

I ran the same watermark test against Adam a few weeks back. This is the result I got:

lleloy!@leloykun · Jul 1

This effect seems to just be an artifact of SGD/Adam/AdamW/etc and more modern optimizers, e.g. Muon/Shampoo/PSGD, don't have this 'issue'. The crux is that the raw 'gradients' we get from backpropagation tend to have low (stable) rank. And optimizers like SGD/AdamW preserves…

533

TianyLin Retweeted

Piotr Nawrot@p_nawrot · Apr 25

Sparse attention is one of the most promising strategies to unlock long-context processing and long generation reasoning in LLMs. We performed the most comprehensive study on training-free sparse attention to date. Here is what we found:

112

645

554

67.0K

TianyLin Retweeted

Tilde@tilderesearch · Jun 25

~8/8~ We release our NSA kernel for experimentation and research here: github.com/tilde-research… At Tilde, we believe interpretability is the path towards building better models. If that sounds cool, reach out!

2.0K

TianyLin Retweeted

Cohere Labs@Cohere_Labs · Jun 26

Join our ML Theory group next week as they welcome @tonysilveti on July 3rd for a presentation on "Training neural networks at any scale" Thanks to @itsmaddox_j @aniervs and @ThangChu77 for organizing this session 👏 Learn more: cohere.com/events/Cohere-…

7.0K

TianyLin@tianylin · Jun 24

huggingface website is down. plz fix it @huggingface

264

TianyLin Retweeted

Zhihao Jia@JiaZhihao · Jun 19

One of the best ways to reduce LLM latency is by fusing all computation and communication into a single GPU megakernel. But writing megakernels by hand is extremely hard. 🚀Introducing Mirage Persistent Kernel (MPK), a compiler that automatically transforms LLMs into optimized…

121

768

574

79.0K

TianyLin Retweeted

Songlin Yang@SonglinYang4 · Jun 11

Flash Linear Attention (github.com/fla-org/flash-…) will no longer maintain support for the RWKV series (existing code will remain available). Here’s why:

792

325

85.0K

TianyLin Retweeted

You Jiacheng@YouJiacheng · Jun 11

If you cite Muon, I think you should definitely cite SSD (proceedings.mlr.press/v38/carlson15.…) by @CevherLIONS et al. (sorry I can't find the handle of other authors) -- which proposed spectral descent.

155

9.0K

TianyLin@tianylin · May 29

great thoughts

OOmar Khattab@lateinteraction · May 29

Sigh, it's a bit of a mess. Let me just give you guys the full nuance in one stream of consciousness since I think we'll continue to get partial interpretations that confuse everyone. All the little things I post need to always be put together in one place. First, I have long…

222

TianyLin@tianylin · May 25

Psa that theorists should really be more careful when working in continuous time gf vs gd (cts time neq runtime) . I have similar remarks when working with gd vs sgd (gd steps neq sample complexity) . The worst is when you have a result on gf and interpret it as if it were for…

JJason Lee@jasondeanlee · May 25

When you discretize and reparametrized to theta=Bw, you no longer can use a constant lr. We found exactly this phenomenon in our work on scaling laws for neural nets that are not 1-homogeneous. arxiv.org/abs/2504.19983 and slides dropbox.com/scl/fi/n53x2pe… also YouTube…

6.0K

TianyLin Retweeted

You Jiacheng@YouJiacheng · May 23

arxiv.org/abs/2505.16932 everyone told me even SVD won't improve loss so I only spent a bit effort on improving the iters. After reading @leloykun 's post last year, I had the intuition that greedy "contraction" will be a good solution, but didn't know it's optimal.

131

104

18.0K

TianyLin@tianylin · May 14

It’s worth checking out, and @jxbz even shares the slides

JJeremy Bernstein@jxbz · May 13

I was really grateful to have the chance to speak at @Cohere_Labs and @ml_collective last week. My goal was to make the most helpful talk that I could have seen as a first-year grad student interested in neural network optimization. Sharing some info about the talk here... (1/6)

388

TianyLin Retweeted

Edward Milsom@edward_milsom · Feb 25

Our paper "Function-Space Learning Rates" is on arXiv! We give an efficient way to estimate the magnitude of changes to NN outputs caused by a particular weight update. We analyse optimiser dynamics in function space, and enable hyperparameter transfer with our scheme FLeRM! 🧵👇

425

373

83.0K

TianyLin Retweeted

Tanishq Abraham is at ICML@iScienceLuvr · May 6

Practical Efficiency of Muon for Pretraining "We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data…

413

299

53.0K