TianyLin
@tianylin
DL practitioner #Ihavecats
Announcing 𝐟𝐥𝐚𝐬𝐡-𝐦𝐮𝐨𝐧: a 🐍 pkg with customized CUDA kernel that aims to boost Muon optimizer: github.com/nil0x9/flash-m… 1/n
very insightful
There is a more general version of this question: why not scale up the parameters of the attention operation and make it more expressive? (you can do it as suggested below, or simply increase the dimension of QKV) The empirical answer is that it’s not nearly as effective as…
It’s not a bug if you use it right! Some work use similar property to accelerate model inferencing, e.g. SliceGPT and LaRoSa (except they use orthogonal matrices to avoid P inverse here).
Lol its worse than this. You can multiply in arbitrary invertible matrices. OV = OIV = (OP^-1) (P V) Which means pretty much the whole of V is redundant in a sense. The only thing that matter is its rank, everything else can be packed into V. This actually holds anywhere you…
You can fix it with doing the softmax in fp32 arxiv.org/abs/2506.13585
I ran the same watermark test against Adam a few weeks back. This is the result I got:
This effect seems to just be an artifact of SGD/Adam/AdamW/etc and more modern optimizers, e.g. Muon/Shampoo/PSGD, don't have this 'issue'. The crux is that the raw 'gradients' we get from backpropagation tend to have low (stable) rank. And optimizers like SGD/AdamW preserves…
Sparse attention is one of the most promising strategies to unlock long-context processing and long generation reasoning in LLMs. We performed the most comprehensive study on training-free sparse attention to date. Here is what we found:
~8/8~ We release our NSA kernel for experimentation and research here: github.com/tilde-research… At Tilde, we believe interpretability is the path towards building better models. If that sounds cool, reach out!
Join our ML Theory group next week as they welcome @tonysilveti on July 3rd for a presentation on "Training neural networks at any scale" Thanks to @itsmaddox_j @aniervs and @ThangChu77 for organizing this session 👏 Learn more: cohere.com/events/Cohere-…
One of the best ways to reduce LLM latency is by fusing all computation and communication into a single GPU megakernel. But writing megakernels by hand is extremely hard. 🚀Introducing Mirage Persistent Kernel (MPK), a compiler that automatically transforms LLMs into optimized…
Flash Linear Attention (github.com/fla-org/flash-…) will no longer maintain support for the RWKV series (existing code will remain available). Here’s why:
If you cite Muon, I think you should definitely cite SSD (proceedings.mlr.press/v38/carlson15.…) by @CevherLIONS et al. (sorry I can't find the handle of other authors) -- which proposed spectral descent.
great thoughts
Sigh, it's a bit of a mess. Let me just give you guys the full nuance in one stream of consciousness since I think we'll continue to get partial interpretations that confuse everyone. All the little things I post need to always be put together in one place. First, I have long…
Psa that theorists should really be more careful when working in continuous time gf vs gd (cts time neq runtime) . I have similar remarks when working with gd vs sgd (gd steps neq sample complexity) . The worst is when you have a result on gf and interpret it as if it were for…
When you discretize and reparametrized to theta=Bw, you no longer can use a constant lr. We found exactly this phenomenon in our work on scaling laws for neural nets that are not 1-homogeneous. arxiv.org/abs/2504.19983 and slides dropbox.com/scl/fi/n53x2pe… also YouTube…
arxiv.org/abs/2505.16932 everyone told me even SVD won't improve loss so I only spent a bit effort on improving the iters. After reading @leloykun 's post last year, I had the intuition that greedy "contraction" will be a good solution, but didn't know it's optimal.
It’s worth checking out, and @jxbz even shares the slides
I was really grateful to have the chance to speak at @Cohere_Labs and @ml_collective last week. My goal was to make the most helpful talk that I could have seen as a first-year grad student interested in neural network optimization. Sharing some info about the talk here... (1/6)
Our paper "Function-Space Learning Rates" is on arXiv! We give an efficient way to estimate the magnitude of changes to NN outputs caused by a particular weight update. We analyse optimiser dynamics in function space, and enable hyperparameter transfer with our scheme FLeRM! 🧵👇
Practical Efficiency of Muon for Pretraining "We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data…