Simone Scardapane
@s_scardapane
I fall in love with a new #machinelearning topic every month 🙄 | Researcher @SapienzaRoma | Author: Alice in a diff wonderland http://sscardapane.it/alice-book
*Alice needs her friends!* If you bought a copy of "Alice in a differentiable wonderland" or are planning to use the book for a course - I'd love to hear your feedback! Working on releasing worked-out code + figure sources for instructors soon. 🙃 sscardapane.it/alice-book/

*ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features* by @alec_helbling @tunahansalih @Ben_Hoov @PINguAR @PoloChau Creates saliency maps for diffusion ViTs by propagating concepts (eg, car) and repurposing cross-attention layers. arxiv.org/abs/2502.04320

*Rethinking Early Stopping: Refine, Then Calibrate* by @Eugene_Berta @LChoshen @DHolzmueller @BachFrancis Doing early stopping on the "refinement loss" (original loss modulo calibration loss) is beneficial for both accuracy and calibration. arxiv.org/abs/2501.19195

*I-Con: A Unifying Framework for Representation Learning* by @Sa_9810 @mhamilton723 et al. They show that many losses (contrastive, supervised, clustering, ...) can be derived from a single loss defined in terms of neighbors distributions. arxiv.org/abs/2504.16929

*Emergence and Evolution of Interpretable Concepts in Diffusion Models* by @berk_tinaz @zalan_fabian @mahdisoltanol SAEs trained on cross-attention layers of StableDiffusion are (surprisingly) good and can be used to intervene on the generation. arxiv.org/abs/2504.15473

🏆 Our @nvidia KV Cache Compression Leaderboard is now live! Compare state-of-the-art compression methods side-by-side with KVPress. See which techniques are leading in efficiency and performance. 🥇 huggingface.co/spaces/nvidia/…
*Dense Backpropagation Improves Training for Sparse MoEs* by @PandaAshwinee @tomgoldsteincs et al. They modify the top-k router of a MoE by adding a "default" activation for unselected experts in order to have a dense gradient during the backward pass. arxiv.org/abs/2504.12463

*Into the land of automatic differentiation* Material is out! A short PhD course for the CS PhD in @SapienzaRoma covering basic and advanced topics in autodiff w/ slides, (rough) Notion notes, and two notebooks including a PyTorch-like implementation. 😅 sscardapane.it/teaching/phd-a…

*Perception Encoder* by @cfeichtenhofer et al. To fine-tune a CLIP model with LLM or dense decoders, the best embeddings may not be at the end. They release code, datasets, and SOTA-like models. arxiv.org/abs/2504.13181

*Antidistillation Sampling* by @yashsavani_ @ashertrockman @zicokolter et al. They modify the logits of a model with a penalty term that poisons potential distillation attempts (by estimating the downstream distillation loss). arxiv.org/abs/2504.13146

*Forgetting Transformer: Softmax Attention with a Forget Gate* by @zhxlin @nikishin_evg @AaronCourville They add a forgetting mechanism to attention by computing a "forget factor" for each token and biasing the attention computation. arxiv.org/abs/2503.02130

Thanks for the citation and the nice words @paraschopra!
This is a fantastic, visually stunning, free introductory book on deep learning. Highly recommended for curious people who want the lay of the land.
news.ycombinator.com/item?id=444261… Now on ycombinator arxiv.org/abs/2404.17625
*OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens* by @liujc1998 et al. A pipeline to quickly trace parts of the LLM output (verbatim) back to the training documents, in almost real-time, tested on the OLMo family of models. arxiv.org/abs/2504.07096

*NoProp: Training Neural Networks without Backpropagation or Forward-propagation* by @yeewhye et al. They use a neural network to define a denoising process over the class labels, which allows them to train the blocks independently (i.e., "no backprop"). arxiv.org/abs/2503.24322

Twitter friends, here's some draft notes for my upcoming course on automatic differentiation, mostly based on the "Elements of differentiable programming" book. Let me know what you think! They also include a notebook on operator overloading. 🙃 notion.so/sscardapane/Au…

*Generalized Interpolating Discrete Diffusion* by @dvruette @orvieto_antonio & al. A class of discrete diffusion models combining standard masking with uniform noise to allow the model to potentially "correct" previously wrong tokens. arxiv.org/abs/2503.04482

*Differentiable Logic Cellular Automata* by @PietroMiotti @eyvindn @RandazzoEttore @zzznah Combines differentiable cellular automata and logic circuits to learn recurrent circuits exhibiting complex (learned) behavior. google-research.github.io/self-organisin…

*From superposition to sparse codes: interpretable representations in NNs* by @klindt_david @ninamiolane @rpatrik96 @charles0neill Nice overview on the linearity of NN representations and the use of sparse coding to recover interpretable activations. arxiv.org/abs/2503.01824

*Recursive Inference Scaling* by @ibomohsin @XiaohuaZhai Recursively applying the first part of a model can be a strong compute-efficient baseline in many scenarios, when evaluating on a fixed compute budget. arxiv.org/abs/2502.07503

*Universal Sparse Autoencoders* by @HThasarathan @Napoolar @MatthewKowal9 @CSProfKGD They train a shared SAE latent space on several vision encoders at once, showing, e.g., how the same concept activates in different models. arxiv.org/abs/2502.03714
