An Yan

@AnYan_ai

@SFResearch Prev-@UCSanDiego @Mircosoft @Meta @Adobe Working on Vision-Language.

Joined February 2023

325Following

81Followers

An Yan@AnYan_ai · Jul 6

i learned about this in a recent project and had to switch back from vLLM to HF (and eat like a 5x slow down) just so my results are consistent. please spread and help a fellow researcher out 🙏 e.g. github.com/vllm-project/v… github.com/vllm-project/v… github.com/vllm-project/v… ...

ffinbarr@finbarrtimbers · Jul 4

horrifying bug of the day is finding out the vllm and huggingface produce significantly different logprobs discuss.vllm.ai/t/numerical-di…

202

110

26.0K

An Yan Retweeted

Andrew Ilyas@andrew_ilyas · Jun 26

“How will my model behave if I change the training data?” Recent(-ish) work w/ @logan_engstrom: we nearly *perfectly* predict ML model behavior as a function of training data, saturating benchmarks for this problem (called “data attribution”).

417

409

53.0K

An Yan@AnYan_ai · Jun 26

Incredible research, not only do they show mechanistic reasons for Whale's NSA and Kimi's MoBA's greater capacity for length extrapolation but they show how NSA can be pruned for even higher throughput! This is the true interpretability🤝capabilities moment. Read.

TTilde@tilderesearch · Jun 25

~7/8~ We analyzed the gating distributions for NSA models and found we can ablate many branches without compromising model performance! Our principled ablations enabled massive gains in throughput without losses in performance.

186

19.0K

An Yan Retweeted

Google AI Developers@googleaidevs · Jun 20

Some fun vibe coded gemini projects👇 (play with them yourself in the final post)

155

2.0K

499.0K

An Yan Retweeted

Mathurin Massias@mathusmassias · Jun 18

New paper on the generalization of Flow Matching arxiv.org/abs/2506.03719 🤯 Why does flow matching generalize? Did you know that the flow matching target you're trying to learn **can only generate training points**? with @Qu3ntinB, Anne Gagneux & Rémi Emonet 👇👇👇

232

1.0K

103.0K

An Yan@AnYan_ai · Jun 12

A horrible @airindia crash today being the first 787 Dreamliner ever to have a fatal incident Both the 787 and the 737 Max are part of the new "problem generation" of Boeing aircraft starting around the mid-2000s Like the 737 Max, the Dreamliner had lots of manufacturing/safety…

@@levelsio@levelsio · Dec 29

The horrible Jeju Air crash shows how random air crashes can be: before this crash Jeju Air had an almost flawless safety record I flew with them lots of times, from and to Jeju Island, which is what it's named after, which is like the Hawaii of South Korea But also from/to…

295

720

8.0K

6.0K

1.9M

An Yan Retweeted

Andrej Karpathy@karpathy · Jun 7

My sleep scores during recent travel were in the 90s. Now back in SF I am consistently back down to 70s, 80s. I am increasingly convinced that this is due to traffic noise from a nearby road/intersection where I live - every ~10min, a car, truck, bus, or motorcycle with a very…

1.0K

818

12.0K

3.0K

1.4M

An Yan@AnYan_ai · Jun 6

This is a modern ViT I can get behind: NaFlexViT! I would also throw in registers and then call it a day :) Also agree with Ross that the model code is easy but the data pipeline part is the annoying one.

RRoss Wightman@wightmanr · Jun 5

timm's got a new vision transformer (NaFlexVit), and it's flexible! I've been plugging away at this for a bit, integrating ideas from FlexiViT, NaViT, and NaFlex and finally ready to merge for initial exploration. The model supports: * variable aspect/size images of NaFlex (see…

115

14.0K

An Yan@AnYan_ai · Jun 6

State space models and RNNs compress history into a constant size state, while attn has KV cache scaling linearly in seqlen. We can instead start from RNNs and let the state size grow logarithmically with seqlen. Feels like a sweet spot. Also beautiful connection to classical…

HHan Guo@HanGuo97 · Jun 6

We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels

108

962

626

97.0K

An Yan@AnYan_ai · May 27

A video generator must satisfy 3 criteria to be a world model: 1️⃣ Causality: Past affects future, not vice versa. 2️⃣ Persistence: The world shouldn't change because you looked away. 3️⃣ Constant Speed: Simulation shouldn't slow down over time. We believe SSMs are a natural fit:…

TTanishq Abraham is at ICML@iScienceLuvr · May 27

Long-Context State-Space Video World Models "we propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency. "

248

131

26.0K

An Yan Retweeted

Ahmad Beirami@abeirami · May 27

As we go through a lot of excitement about RL recently with lots of cool work/results, here is a reminder that RL with a reverse KL-regularizer to the base model cannot learn new skills that were not already present in the base model. It can only amplify the existing weak skills.

474

471

66.0K

An Yan@AnYan_ai · May 21

worth reading. i love @CerebrasSystems guys’ recent works so much! here is my summary after quick skimming. main RQ: “how should we scale bsz or lr given N, D and adjust other HPs like weight decay (wd) according to them?” they focused on wd and bsz this time. (1/n)

SShane Bergsma@ShaneBergsma · May 21

Power Lines paper now out: arxiv.org/abs/2505.13738 TL;DR - we identify how AdamW's weight decay should scale with batch size, dataset size, and model size in LLM pre-training. We also investigate the scaling of both "optimal" and "critical" batch size.

2.0K

An Yan Retweeted

Seunghyun Seo@SeunghyunSEO7 · May 22

bytedance bros... stop... i can't follow... arxiv.org/abs/2505.15270 scaling diffusion with muP @cloneofsimo

598

440

64.0K

An Yan Retweeted

Niels Rogge@NielsRogge · May 21

ByteDance released a 37-page report on training a Gemini-like native multimodal model! The most interesting part imo is on the "Integrated Transformer" architecture, where they use the same backbone to act both as a GPT-like autoregressive model as well as a DiT diffusion model

566

517

39.0K

An Yan@AnYan_ai · May 21

excited to finally share on arxiv what we've known for a while now: All Embedding Models Learn The Same Thing embeddings from different models are SO similar that we can map between them based on structure alone. without *any* paired data feels like magic, but it's real:🧵

jjxmo@jxmnop · Feb 23

this is sick all i'll say is that these GIFs are proof that the biggest bet of my research career is gonna pay off excited to say more soon

125

622

6.0K

5.0K

901.0K

An Yan@AnYan_ai · May 20

They study model merging (EMA, soups), which is well-understood for fine-tunings of a base model, but they investigate *pre-training* of LLMs. My summary thread. (YoU wOn'T bEliEvE the surprise in post #5 which makes me like this group a lot more!)

TTanishq Abraham is at ICML@iScienceLuvr · May 20

Model Merging in Pre-training of Large Language Models "We present the Pre-trained Model Averaging (PMA) strategy, a novel framework for model merging during LLM pre-training. Through extensive experiments across model scales (from millions to over 100B parameters), we…

538

659

110.0K

An Yan Retweeted

the tiny corp@__tinygrad__ · May 18

I'm sick of all the software slop. Millions of lines, and 90% of them do not express any ideas. They simply exist to support the other millions of lines. Dead weight abstraction layers that are too costly to refactor away. Plumbing linking old systems to newer systems to the new…

2.0K

433

113.0K

An Yan Retweeted

kalomaze@kalomaze · May 18

here is unambiguous proof showing how @MistralAI trains on the test set! Github NIAH test vs. custom NIAH (procedurally generated facts and questions, instead of using the exact strings from the repo)

989

376

114.0K

An Yan Retweeted

jxmo@jxmnop · May 18

the closest thing i've seen to actual "physics for LMs" was probably this (single-author!) paper from neurips 2024: Understanding Transformers via N-Gram Statistics this is how we used to think about LMs; not sure why we stopped.

100

1.0K

115.0K

An Yan@AnYan_ai · May 17

The VPN industry makes $70 billion per year, it's worth almost a trillion dollars So this tweet will get a lot of pushback If you don't believe me: ask any security researcher with credentials and they'll probably mostly agree The big VPN companies have used lots of FUD to…

@@levelsio@levelsio · May 16

HTTPS + set custom DNS to 8.8.8.8 or 1.1.1.1 and your traffic is always private The sites you visit are then sent to Google or Cloudflare, and HTTPS encrypts the traffic end-to-end No need for an expensive VPN subscription in most cases Don't fall for their FUD

394

352

6.0K

5.0K

1.5M