Tianyuan Zhang (@tianyuanzhang99)

Pinned

T

Tianyuan Zhang@tianyuanzhang99 · Jun 3

Bored of linear recurrent memories (e.g., linear attention) and want a scalable, nonlinear alternative? Our new paper “Test-Time Training Done Right” propose LaCT (Large Chunk Test-Time Training) — a highly efficient, massively scalable nonlinear memory with: 💡 Pure PyTorch…

5

80

416

276

91.0K

T

Tianyuan Zhang@tianyuanzhang99 · Jul 17

Model and training code for LaCT on language model, AR video gen and novel view synthesis are released, also have a TTT layer implementation with sequence parallel supported. Both object-centric and scene-level view synthesis checkpoints are released 🤓— come play!

TTianyuan Zhang@tianyuanzhang99 · Jun 3

Bored of linear recurrent memories (e.g., linear attention) and want a scalable, nonlinear alternative? Our new paper “Test-Time Training Done Right” propose LaCT (Large Chunk Test-Time Training) — a highly efficient, massively scalable nonlinear memory with: 💡 Pure PyTorch…

3

19

113

23

9.0K

Tianyuan Zhang Retweeted

S

Shivam Duggal@ShivamDuggal4 · Jul 11

Compression is the heart of intelligence From Occam to Kolmogorov—shorter programs=smarter representations Meet KARL: Kolmogorov-Approximating Representation Learning. Given an image, token budget T & target quality 𝜖 —KARL finds the smallest t≤T to reconstruct it within 𝜖🧵

10

62

344

269

51.0K

T

Tianyuan Zhang@tianyuanzhang99 · Jun 25

I feel we need both. compression and sparsity are orthogonal, sometimes even the opposite.

WWenhao Chai@wenhaocha1 · Jun 25

talke a look at this blog introduce sparse attn and the implementation, which I think currently more promising than compression based method for long-context modeling

2

0

27

10

5.0K

Tianyuan Zhang Retweeted

H

Haoyu Xiong@Haoyu_Xiong_ · Jun 19

Your bimanual manipulators might need a Robot Neck 🤖🦒 Introducing Vision in Action: Learning Active Perception from Human Demonstrations ViA learns task-specific, active perceptual strategies—such as searching, tracking, and focusing—directly from human demos, enabling robust…

17

86

362

120

86.0K

Tianyuan Zhang Retweeted

K

Kai He@Kai__He · Jun 19

🚀 Introducing UniRelight, a general-purpose relighting framework powered by video diffusion models. 🌟UniRelight jointly models the distribution of scene intrinsics and illumination, enabling high-quality relighting and intrinsic decomposition from a single image or video.

9

43

163

87

21.0K

Tianyuan Zhang Retweeted

B

Bowen Li@Bw_Li1024 · Jun 18

"Generalization means being able to solve problems that the system hasn't been prepared for." Our latest work in #RSS2025 can automatically invent neural networks as state abstractions, which help robots generalize. Check it out here: jaraxxus-me.github.io/IVNTR/

5

26

123

57

17.0K

T

Tianyuan Zhang@tianyuanzhang99 · Jun 11

Just arrived at Nashville for CVPR! Looking forward to chat on any topics!

T@ ·

0

13

0

795

T

Tianyuan Zhang@tianyuanzhang99 · Jun 9

Thanks Songlin and Xinyu for hosting. Here is the recording and slides.

SSonglin Yang@SonglinYang4 · Jun 9

Recording: youtube.com/watch?v=5QxQUr… Slides: asap-seminar.github.io/assets/slides/…

1

3

33

9

5.0K

T

Tianyuan Zhang@tianyuanzhang99 · Jun 9

Happening in 5 min

SSonglin Yang@SonglinYang4 · Jun 9

Test-time training (TTT) is an elegant framework for adapting context to model weights. In today’s ASAP seminar (2pm Eastern Time), @tianyuanzhang99 presents Large Chunk TTT (LaCT) — a simple, efficient method combining TTT with chunked attention to unlock new opportunities.

0

1

18

0

2.0K

Tianyuan Zhang Retweeted

X

Xun Huang@xunhuang1995 · Jun 9

Real-time video generation is finally real — without sacrificing quality. Introducing Self-Forcing, a new paradigm for training autoregressive diffusion models. The key to high quality? Simulate the inference process during training by unrolling transformers with KV caching.

26

126

779

583

135.0K

T

Tianyuan Zhang@tianyuanzhang99 · Jun 6

Check out log-linear attention—our latest approach to overcoming the fundamental limitation of RNNs’ constant state size, while preserving subquadratic time and space complexity

HHan Guo@HanGuo97 · Jun 6

We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels

1

52

571

324

46.0K

T

Tianyuan Zhang@tianyuanzhang99 · Jun 5

Finally! We just released the models and code for PS3 & VILA-HD, a vision encoder **pre-trained at 4K resolution** and the resulting MLLM! PS3 & VILA-HD models: huggingface.co/collections/nv… PS3 code: github.com/NVlabs/PS3 VILA-HD code: github.com/NVlabs/VILA/tr… Demo:…

BBaifeng@baifeng_shi · Mar 27

Next-gen vision pre-trained models shouldn’t be short-sighted. Humans can easily perceive 10K x 10K resolution. But today’s top vision models—like SigLIP and DINOv2—are still pre-trained at merely hundreds by hundreds of pixels, bottlenecking their real-world usage. Today, we…

2

28

138

58

9.0K

Tianyuan Zhang Retweeted

H

Han Guo@HanGuo97 · Jun 6

We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels

16

198

1.0K

848

257.0K