Tianyuan Zhang
@tianyuanzhang99
PhDing in@MIT, towards general intelligence and lifelong machine M.S. in CMU, B.S. in PKU.
Bored of linear recurrent memories (e.g., linear attention) and want a scalable, nonlinear alternative? Our new paper “Test-Time Training Done Right” propose LaCT (Large Chunk Test-Time Training) — a highly efficient, massively scalable nonlinear memory with: 💡 Pure PyTorch…
Model and training code for LaCT on language model, AR video gen and novel view synthesis are released, also have a TTT layer implementation with sequence parallel supported. Both object-centric and scene-level view synthesis checkpoints are released 🤓— come play!
Bored of linear recurrent memories (e.g., linear attention) and want a scalable, nonlinear alternative? Our new paper “Test-Time Training Done Right” propose LaCT (Large Chunk Test-Time Training) — a highly efficient, massively scalable nonlinear memory with: 💡 Pure PyTorch…
Compression is the heart of intelligence From Occam to Kolmogorov—shorter programs=smarter representations Meet KARL: Kolmogorov-Approximating Representation Learning. Given an image, token budget T & target quality 𝜖 —KARL finds the smallest t≤T to reconstruct it within 𝜖🧵
I feel we need both. compression and sparsity are orthogonal, sometimes even the opposite.
talke a look at this blog introduce sparse attn and the implementation, which I think currently more promising than compression based method for long-context modeling
Your bimanual manipulators might need a Robot Neck 🤖🦒 Introducing Vision in Action: Learning Active Perception from Human Demonstrations ViA learns task-specific, active perceptual strategies—such as searching, tracking, and focusing—directly from human demos, enabling robust…
🚀 Introducing UniRelight, a general-purpose relighting framework powered by video diffusion models. 🌟UniRelight jointly models the distribution of scene intrinsics and illumination, enabling high-quality relighting and intrinsic decomposition from a single image or video.
"Generalization means being able to solve problems that the system hasn't been prepared for." Our latest work in #RSS2025 can automatically invent neural networks as state abstractions, which help robots generalize. Check it out here: jaraxxus-me.github.io/IVNTR/
Thanks Songlin and Xinyu for hosting. Here is the recording and slides.
Recording: youtube.com/watch?v=5QxQUr… Slides: asap-seminar.github.io/assets/slides/…
Happening in 5 min
Test-time training (TTT) is an elegant framework for adapting context to model weights. In today’s ASAP seminar (2pm Eastern Time), @tianyuanzhang99 presents Large Chunk TTT (LaCT) — a simple, efficient method combining TTT with chunked attention to unlock new opportunities.
Real-time video generation is finally real — without sacrificing quality. Introducing Self-Forcing, a new paradigm for training autoregressive diffusion models. The key to high quality? Simulate the inference process during training by unrolling transformers with KV caching.
Check out log-linear attention—our latest approach to overcoming the fundamental limitation of RNNs’ constant state size, while preserving subquadratic time and space complexity
We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels
Finally! We just released the models and code for PS3 & VILA-HD, a vision encoder **pre-trained at 4K resolution** and the resulting MLLM! PS3 & VILA-HD models: huggingface.co/collections/nv… PS3 code: github.com/NVlabs/PS3 VILA-HD code: github.com/NVlabs/VILA/tr… Demo:…
Next-gen vision pre-trained models shouldn’t be short-sighted. Humans can easily perceive 10K x 10K resolution. But today’s top vision models—like SigLIP and DINOv2—are still pre-trained at merely hundreds by hundreds of pixels, bottlenecking their real-world usage. Today, we…
We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels