Albert Tseng
@tsengalb99
CS PhD Student @ Cornell
Excited to announce our #AISTATS📜on training LLMs with MXFP4! We use stoch. rounding and random Hadamard transforms (all fast on HW) to get low-variance, unbiased gradient estimates with MXFP4 GEMMs. We get a ~30% speedup over FP8 with almost no PPL gap! arxiv.org/abs/2502.20586


❓ Are LLMs actually problem solvers or just good at regurgitating facts? 🚨New Benchmark Alert! We built HeuriGym to benchmark if LLMs can craft real heuristics for real-world hard combinatorial optimization problems. 🛞 We’re open-sourcing it all: ✅ 9 problems ✅ Iterative…
New paper: World models + Program synthesis by @topwasu 1. World modeling on-the-fly by synthesizing programs w/ 4000+ lines of code 2. Learns new environments from minutes of experience 3. Positive score on Montezuma's Revenge 4. Compositional generalization to new environments…
I will be at #CVPR2025 presenting our work on differential operators for hybrid neural fields! Catch me at our poster: 🗓️ Fri, June 13, 10:30 AM–12:30 PM 📍 ExHall D, Poster #34 🔗 cvpr.thecvf.com/virtual/2025/p… Details below ⬇️
📢 Excited to share our latest work on computing accurate differential operators for hybrid neural fields (like Instant NGP)! 🔗: justachetan.github.io/hnf-derivative… 🧵👇🏻 (1/n)
Checkout CARTRIDGES, scaling cache-time compute! An alternative to ICL for settings where many different user messages reference the same large corpus of text!
When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x…
When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x…
Albert and co continue to do excellent work on quantization. This time the trick is to minimize KL wrt the original model, with a clever hessian factorization
📣Introducing our latest work: Yet Another Quantization Algorithm! YAQA directly minimizes the KL divergence to the original model during rounding, cutting it by >30% over prior PTQ methods and giving an even closer model than Google’s QAT on Gemma! 🤯 arxiv.org/abs/2505.22988👇
Apparently I chose the worst day to release a paper, so ICYMI, we made a post-training quantization algorithm that outperforms even @Google's quantization-aware training recipe. We beat the prior SOTA by >30%, meaning faster and smaller models. More details in the original 🧵👇
📣Introducing our latest work: Yet Another Quantization Algorithm! YAQA directly minimizes the KL divergence to the original model during rounding, cutting it by >30% over prior PTQ methods and giving an even closer model than Google’s QAT on Gemma! 🤯 arxiv.org/abs/2505.22988👇
VideoPrism is now available at: github.com/google-deepmin… :)
Introducing VideoPrism, a single model for general-purpose video understanding that can handle a wide range of tasks, including classification, localization, retrieval, captioning and question answering. Learn how it works at goo.gle/49ltEXW
5/ Quantized models don't need to lose fidelity. Check out our paper and blog for details: 📝 Paper: arxiv.org/abs/2505.22988 📖 Blog: together.ai/blog/yaqa 💻 Code: github.com/Cornell-RelaxM…
chipmunk is up on arxiv! across HunyuanVideo and Flux.1-dev, 5-25% of the intermediate activation values in attention and MLPs account for 70-90% of the change in activations across steps caching + sparsity speeds up generation by only recomputing fast changing activations
🚀 New research: YAQA — Yet Another Quantization Algorithm (yes, pronounced like yaca/jackfruit 🥭) Led by @tsengalb99, YAQA minimizes the KL divergence to the original model during quantization, cutting it by >30% vs. prior methods and outperforming even QAT on Gemma 3. 👇
🔥Thrilled to share that I’ll be joining the Computer Science Department at NYU Shanghai as an Assistant Professor starting Fall 2025! @nyushanghai 🎯 I’ll be recruiting PhD students across the entire NYU network—including @nyushanghai, @nyutandon, and @NYU_Courant—to build…