Alessio Devoto

@devoto_alessio

Rome, Lazio

Joined February 2022

554Following

512Followers

Pinned

Alessio Devoto@devoto_alessio · Jun 19, 2024

A simple L₂ norm-based strategy can compress KV caches by up to 90% without sacrificing accuracy! 🚀 In arxiv.org/abs/2406.11430, we find the attention score of a KV pair is very correlated to the Key Embedding’s L₂ norm! Super fun project w/ @yuzhaouoe @s_scardapane @pminervini

devoto_alessio's tweet image. A simple L₂ norm-based strategy can compress KV caches by up to 90% without sacrificing accuracy! 🚀 In arxiv.org/abs/2406.11430, we find the attention score of a KV pair is very correlated to the Key Embedding’s L₂ norm! Super fun project w/ @yuzhaouoe @s_scardapane @pminervini

155

29.0K

Alessio Devoto Retweeted

Joshua Ong @ ACL2025@joshuaongg21 · Jul 23

'Theorem Prover as a Judge for Sythetic Data Generation' has been accepted to ACL (Main) 🚀. Do check us out at July 30th (Wednesday) 11:00- 12:30pm at Hall 4/5! A huge thank you to my amazing collaborators: Shay @GiwonHong413849 @WendaLi8 📝: aclanthology.org/2025.acl-long.…

4.0K

Alessio Devoto Retweeted

Andrea Santilli@teelinsan · Jul 22

Uncertainty quantification (UQ) is key for safe, reliable LLMs... but are we evaluating it correctly? 🚨 Our ACL2025 paper finds a hidden flaw: if both UQ methods and correctness metrics are biased by the same factor (e.g., response length), evaluations get systematically skewed

2.0K

Alessio Devoto Retweeted

Aryo Pradipta Gema@aryopg · Jul 22

New Anthropic Research: “Inverse Scaling in Test-Time Compute” We found cases where longer reasoning leads to lower accuracy. Our findings suggest that naïve scaling of test-time compute may inadvertently reinforce problematic reasoning patterns. 🧵

154

1.0K

575

142.0K

Alessio Devoto Retweeted

Neel Rajani @ICML'25@NeelRajani_ · Jul 16

🚨New paper alert!🚨 "Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them" @ActInterp ICML'25 @deepseek_ai popularised RLVR and distillation for 'reasoning training'! But how do they differ under the hood? Details in 🧵: (1/8)

4.0K

Alessio Devoto Retweeted

Simone Scardapane@s_scardapane · Jul 15

*Into the land of automatic differentiation* Material is out! A short PhD course for the CS PhD in @SapienzaRoma covering basic and advanced topics in autodiff w/ slides, (rough) Notion notes, and two notebooks including a PyTorch-like implementation. 😅 sscardapane.it/teaching/phd-a…

118

753

668

31.0K

Alessio Devoto@devoto_alessio · Jul 12

Results on MMLU-Redux (arxiv.org/abs/2406.04127, NAACL'25), our manually curated and error-free subset of MMLU, are super strong as well!

KKimi.ai@Kimi_Moonshot · Jul 11

🚀 Hello, Kimi K2! Open-Source Agentic Model! 🔹 1T total / 32B active MoE model 🔹 SOTA on SWE Bench Verified, Tau2 & AceBench among open models 🔹Strong in coding and agentic tasks 🐤 Multimodal & thought-mode not supported for now With Kimi K2, advanced agentic intelligence…

2.0K

Alessio Devoto@devoto_alessio · Jul 9

oops did it again

HHolly Guevara@hollylawly · Jul 8

bouncy ball guy has done it again

126

1.0K

14.0K

5.0K

4.0M

Alessio Devoto Retweeted

Simone Scardapane@s_scardapane · Jun 26

Twitter friends, here's some draft notes for my upcoming course on automatic differentiation, mostly based on the "Elements of differentiable programming" book. Let me know what you think! They also include a notebook on operator overloading. 🙃 notion.so/sscardapane/Au…

2.0K

Alessio Devoto Retweeted

Unsloth AI@UnslothAI · Jun 24

We made a Guide on mastering LoRA Hyperparameters, so you can learn to fine-tune LLMs correctly! Learn to: • Train smarter models with fewer hallucinations • Choose optimal: learning rates, epochs, LoRA rank, alpha • Avoid overfitting & underfitting 🔗docs.unsloth.ai/get-started/fi…

131

690

537

24.0K

Alessio Devoto Retweeted

Sebastian Raschka@rasbt · Jun 18

Understanding and Coding KV Caching From Scratch -- The Extended Edition magazine.sebastianraschka.com/p/coding-the-k…

109

764

665

43.0K

Alessio Devoto Retweeted

Tanishq Abraham is at ICML@iScienceLuvr · Jun 18

LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs "we propose LongLLaDA, a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation. Our results validate that established extrapolation scaling laws remain effective for extending the…

215

117

14.0K

Alessio Devoto Retweeted

Sebastian Raschka@rasbt · Jun 15

Feels good to be back coding! Just picked a fun one from my “someday” side project list and finally added a KV cache to the LLMs From Scratch repo: github.com/rasbt/LLMs-fro…

120

1.0K

648

56.0K

Alessio Devoto Retweeted

Tanishq Abraham is at ICML@iScienceLuvr · Jun 12

Resa: Transparent Reasoning Models via SAEs "Specifically, SAE-Tuning involves two key stages: First, we use an SAE to probe the internal activations of a source model, identifying and extracting a dictionary of latent features that correspond to its reasoning processes. Second,…

164

106

11.0K

Alessio Devoto@devoto_alessio · Jun 10

Transformers don’t need *trained* registers!

AAmil Dravid@_AmilDravid · Jun 10

Artifacts in your attention maps? Forgot to train with registers? Use 𝙩𝙚𝙨𝙩-𝙩𝙞𝙢𝙚 𝙧𝙚𝙜𝙞𝙨𝙩𝙚𝙧𝙨! We find a sparse set of activations set artifact positions. We can shift them anywhere ("Shifted") — even outside the image into an untrained token. Clean maps, no retrain.

1.0K