Sai Surya Duvvuri
@dvsaisurya
Visiting Researcher at FAIR, Meta and CS PhD student at UT Austin. Previously, SR at Google | Pre-Doctoral Research Fellow at MSR India | CS UG at IIT KGP
📢 Thrilled to share our new paper, LASER: Attention with Exponential Transformation, accepted at ICML2025, work done at Google. Come by our poster presentation! 🗓️ Thurs, July 17th, 4:30-7pm 📍 West Exhibition Hall B2-B3, # W-915 Read the full paper here: arxiv.org/abs/2411.03493
New AI model tweak could change how Transformers read text. Instead of just comparing pairs of words, this new approach looks at triples—capturing more context from each token. Here’s how triplet attention, called 2‑simplicial, could make models smarter while keeping data costs…
top-k greedy inference for diffusion models can unlock better accuracies. Wondering if finding the optimal order to unmask the tokens can be automated across prompts/tasks.
Excited about this new work where we dig into the role of token order in masked diffusions! MDMs train on some horribly hard tasks, but careful planning at inference can sidestep the hardest ones, dramatically improving over vanilla MDM sampling (e.g. 7%->90% acc on Sudoku) 1/
Thrilled to share that our work received the Outstanding Paper Award at ICML! I will be giving the oral presentation on Tuesday at 4:15 PM. @Jaeyeon_Kim_0 and I both will be at the poster session shortly after the oral presentation. Please attend if possible!
Excited about this new work where we dig into the role of token order in masked diffusions! MDMs train on some horribly hard tasks, but careful planning at inference can sidestep the hardest ones, dramatically improving over vanilla MDM sampling (e.g. 7%->90% acc on Sudoku) 1/
A team from #KempnerInstitute, @hseas & @UTCompSci has won a best paper award at #ICML2025 for work unlocking the potential of masked diffusion models. Congrats to @Jaeyeon_Kim_0, @shahkulin98, Vasilis Kontonis, @ShamKakade6 and @sitanch. kempnerinstitute.harvard.edu/news/kempner-i… #AI
If you liked CASPR, you will like LASER Attention! Check it out
📢 Thrilled to share our new paper, LASER: Attention with Exponential Transformation, accepted at ICML2025, work done at Google. Come by our poster presentation! 🗓️ Thurs, July 17th, 4:30-7pm 📍 West Exhibition Hall B2-B3, # W-915 Read the full paper here: arxiv.org/abs/2411.03493
MuonClip... so many tricks to make maximum logits bounded during training. Gets me wondering why dont people try LASER (and maybe, z-loss ?)
Very interesting, standard attention causes vanishing gradient due to most prob being very small after some training. LASER tackles this by pushing the attention operation on exponential space. i.e., exp_output = sm(QK^T) exp(V) They dont seem to exaggerate on the performance…
Sensitivity and Sharpness of n-Simplical Attention On the topic of stabilizing training, I got unreasonably nerdsniped by the 2-simplical attention and ended up deriving the sensitivity and sharpness bounds of n-simplical attention more generally...
I want you all to read @Kimi_Moonshot's technical report on K2 then go back to this thread awesome work by @Jianlin_S and team! x.com/Yuchenj_UW/sta…
As we're running out of high-quality training data, changing models' architecture is an essential solution. 2-simplicial Transformer - @AIatMeta's new type of Transformer with special attention mechanism that: ➡️ Compares triplets of tokens (not pairs) to capture richer…
This week's top AI/ML research papers: - 2 Simplicial Attention - UMA - Transition Matching - GLM-4.1V-Thinking - The Trilemma of Truth in LLMs - Do Vision-Language Models Have Internal World Models? - The Automated LLM Speedrunning Benchmark - RoboScape - Test-Time Scaling with…
Meta researchers just dropped a new twist on Transformers—“2-simplicial attention”—and the early results are wild. Instead of classic dot-product pairs, the model uses trilinear functions (think attention over 3-way interactions) via an optimized Triton kernel. The payoff?…
An explanation of Match3 functions and motivation for 2-simplicial attention. You have a shelf of about 30 film which include: Lord of the Rings, Independence Day, Harry Potter series, Idiocracy, Mission: Impossible, The Social Network, Die Hard series, and so on. Now say you…
You just know they wanted to call this "2 Fast 2-Simplicial" so bad.
Excited to share what I worked on during my time at Meta. - We introduce a Triton-accelerated Transformer with *2-simplicial attention*—a tri-linear generalization of dot-product attention - We show how to adapt RoPE to tri-linear forms - We show 2-simplicial attention scales…
many many many such cases lots of alpha in scaling up simple 2018-2021 ideas that didn’t win the academia attention game
Wait what this paper has 4 citations and these guys decided to scale this to billion parameter scale with efficient triton implementation? Incredible. Huge respect...