An Yan
@AnYan_ai
@SFResearch Prev-@UCSanDiego @Mircosoft @Meta @Adobe Working on Vision-Language.
i learned about this in a recent project and had to switch back from vLLM to HF (and eat like a 5x slow down) just so my results are consistent. please spread and help a fellow researcher out 🙏 e.g. github.com/vllm-project/v… github.com/vllm-project/v… github.com/vllm-project/v… ...
horrifying bug of the day is finding out the vllm and huggingface produce significantly different logprobs discuss.vllm.ai/t/numerical-di…
“How will my model behave if I change the training data?” Recent(-ish) work w/ @logan_engstrom: we nearly *perfectly* predict ML model behavior as a function of training data, saturating benchmarks for this problem (called “data attribution”).
Incredible research, not only do they show mechanistic reasons for Whale's NSA and Kimi's MoBA's greater capacity for length extrapolation but they show how NSA can be pruned for even higher throughput! This is the true interpretability🤝capabilities moment. Read.
~7/8~ We analyzed the gating distributions for NSA models and found we can ablate many branches without compromising model performance! Our principled ablations enabled massive gains in throughput without losses in performance.
Some fun vibe coded gemini projects👇 (play with them yourself in the final post)
New paper on the generalization of Flow Matching arxiv.org/abs/2506.03719 🤯 Why does flow matching generalize? Did you know that the flow matching target you're trying to learn **can only generate training points**? with @Qu3ntinB, Anne Gagneux & Rémi Emonet 👇👇👇
A horrible @airindia crash today being the first 787 Dreamliner ever to have a fatal incident Both the 787 and the 737 Max are part of the new "problem generation" of Boeing aircraft starting around the mid-2000s Like the 737 Max, the Dreamliner had lots of manufacturing/safety…
The horrible Jeju Air crash shows how random air crashes can be: before this crash Jeju Air had an almost flawless safety record I flew with them lots of times, from and to Jeju Island, which is what it's named after, which is like the Hawaii of South Korea But also from/to…
My sleep scores during recent travel were in the 90s. Now back in SF I am consistently back down to 70s, 80s. I am increasingly convinced that this is due to traffic noise from a nearby road/intersection where I live - every ~10min, a car, truck, bus, or motorcycle with a very…
This is a modern ViT I can get behind: NaFlexViT! I would also throw in registers and then call it a day :) Also agree with Ross that the model code is easy but the data pipeline part is the annoying one.
timm's got a new vision transformer (NaFlexVit), and it's flexible! I've been plugging away at this for a bit, integrating ideas from FlexiViT, NaViT, and NaFlex and finally ready to merge for initial exploration. The model supports: * variable aspect/size images of NaFlex (see…
State space models and RNNs compress history into a constant size state, while attn has KV cache scaling linearly in seqlen. We can instead start from RNNs and let the state size grow logarithmically with seqlen. Feels like a sweet spot. Also beautiful connection to classical…
We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels
A video generator must satisfy 3 criteria to be a world model: 1️⃣ Causality: Past affects future, not vice versa. 2️⃣ Persistence: The world shouldn't change because you looked away. 3️⃣ Constant Speed: Simulation shouldn't slow down over time. We believe SSMs are a natural fit:…
Long-Context State-Space Video World Models "we propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency. "
As we go through a lot of excitement about RL recently with lots of cool work/results, here is a reminder that RL with a reverse KL-regularizer to the base model cannot learn new skills that were not already present in the base model. It can only amplify the existing weak skills.
worth reading. i love @CerebrasSystems guys’ recent works so much! here is my summary after quick skimming. main RQ: “how should we scale bsz or lr given N, D and adjust other HPs like weight decay (wd) according to them?” they focused on wd and bsz this time. (1/n)
Power Lines paper now out: arxiv.org/abs/2505.13738 TL;DR - we identify how AdamW's weight decay should scale with batch size, dataset size, and model size in LLM pre-training. We also investigate the scaling of both "optimal" and "critical" batch size.
bytedance bros... stop... i can't follow... arxiv.org/abs/2505.15270 scaling diffusion with muP @cloneofsimo
ByteDance released a 37-page report on training a Gemini-like native multimodal model! The most interesting part imo is on the "Integrated Transformer" architecture, where they use the same backbone to act both as a GPT-like autoregressive model as well as a DiT diffusion model
excited to finally share on arxiv what we've known for a while now: All Embedding Models Learn The Same Thing embeddings from different models are SO similar that we can map between them based on structure alone. without *any* paired data feels like magic, but it's real:🧵
this is sick all i'll say is that these GIFs are proof that the biggest bet of my research career is gonna pay off excited to say more soon
They study model merging (EMA, soups), which is well-understood for fine-tunings of a base model, but they investigate *pre-training* of LLMs. My summary thread. (YoU wOn'T bEliEvE the surprise in post #5 which makes me like this group a lot more!)
Model Merging in Pre-training of Large Language Models "We present the Pre-trained Model Averaging (PMA) strategy, a novel framework for model merging during LLM pre-training. Through extensive experiments across model scales (from millions to over 100B parameters), we…
I'm sick of all the software slop. Millions of lines, and 90% of them do not express any ideas. They simply exist to support the other millions of lines. Dead weight abstraction layers that are too costly to refactor away. Plumbing linking old systems to newer systems to the new…
here is unambiguous proof showing how @MistralAI trains on the test set! Github NIAH test vs. custom NIAH (procedurally generated facts and questions, instead of using the exact strings from the repo)
the closest thing i've seen to actual "physics for LMs" was probably this (single-author!) paper from neurips 2024: Understanding Transformers via N-Gram Statistics this is how we used to think about LMs; not sure why we stopped.
The VPN industry makes $70 billion per year, it's worth almost a trillion dollars So this tweet will get a lot of pushback If you don't believe me: ask any security researcher with credentials and they'll probably mostly agree The big VPN companies have used lots of FUD to…
HTTPS + set custom DNS to 8.8.8.8 or 1.1.1.1 and your traffic is always private The sites you visit are then sent to Google or Cloudflare, and HTTPS encrypts the traffic end-to-end No need for an expensive VPN subscription in most cases Don't fall for their FUD