Simone Scardapane

@s_scardapane

I fall in love with a new #machinelearning topic every month 🙄 | Researcher @SapienzaRoma | Author: Alice in a diff wonderland http://sscardapane.it/alice-book

Roma (Italy)

Joined March 2020

656Following

12KFollowers

Pinned

Simone Scardapane@s_scardapane · Jul 26

*Alice needs her friends!* If you bought a copy of "Alice in a differentiable wonderland" or are planning to use the book for a course - I'd love to hear your feedback! Working on releasing worked-out code + figure sources for instructors soon. 🙃 sscardapane.it/alice-book/

s_scardapane's tweet image. *Alice needs her friends!*

If you bought a copy of "Alice in a differentiable wonderland" or are planning to use the book for a course - I'd love to hear your feedback!

Working on releasing worked-out code + figure sources for instructors soon. 🙃

sscardapane.it/alice-book/

435

285

40.0K

Pinned

Simone Scardapane@s_scardapane · Jun 16

*ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features* by @alec_helbling @tunahansalih @Ben_Hoov @PINguAR @PoloChau Creates saliency maps for diffusion ViTs by propagating concepts (eg, car) and repurposing cross-attention layers. arxiv.org/abs/2502.04320

s_scardapane's tweet image. *ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features*
by @alec_helbling @tunahansalih @Ben_Hoov @PINguAR @PoloChau

Creates saliency maps for diffusion ViTs by propagating concepts (eg, car) and repurposing cross-attention layers.

arxiv.org/abs/2502.04320

345

208

13.0K

Pinned

Simone Scardapane@s_scardapane · Jun 13

*Rethinking Early Stopping: Refine, Then Calibrate* by @Eugene_Berta @LChoshen @DHolzmueller @BachFrancis Doing early stopping on the "refinement loss" (original loss modulo calibration loss) is beneficial for both accuracy and calibration. arxiv.org/abs/2501.19195

s_scardapane's tweet image. *Rethinking Early Stopping: Refine, Then Calibrate*
by @Eugene_Berta @LChoshen @DHolzmueller @BachFrancis

Doing early stopping on the "refinement loss" (original loss modulo calibration loss) is beneficial for both accuracy and calibration.

arxiv.org/abs/2501.19195

2.0K

Simone Scardapane@s_scardapane · 19 h

*I-Con: A Unifying Framework for Representation Learning* by @Sa_9810 @mhamilton723 et al. They show that many losses (contrastive, supervised, clustering, ...) can be derived from a single loss defined in terms of neighbors distributions. arxiv.org/abs/2504.16929

s_scardapane's tweet image. *I-Con: A Unifying Framework for Representation Learning*
by @Sa_9810 @mhamilton723 et al.

They show that many losses (contrastive, supervised, clustering, ...) can be derived from a single loss defined in terms of neighbors distributions.

arxiv.org/abs/2504.16929

208

129

9.0K

Simone Scardapane@s_scardapane · Jul 22

*Emergence and Evolution of Interpretable Concepts in Diffusion Models* by @berk_tinaz @zalan_fabian @mahdisoltanol SAEs trained on cross-attention layers of StableDiffusion are (surprisingly) good and can be used to intervene on the generation. arxiv.org/abs/2504.15473

s_scardapane's tweet image. *Emergence and Evolution of Interpretable Concepts in Diffusion Models*
by @berk_tinaz @zalan_fabian @mahdisoltanol

SAEs trained on cross-attention layers of StableDiffusion are (surprisingly) good and can be used to intervene on the generation.

arxiv.org/abs/2504.15473

288

164

10.0K

Simone Scardapane Retweeted

Alessio Devoto@devoto_alessio · Jul 21

🏆 Our @nvidia KV Cache Compression Leaderboard is now live! Compare state-of-the-art compression methods side-by-side with KVPress. See which techniques are leading in efficiency and performance. 🥇 huggingface.co/spaces/nvidia/…

250

102

17.0K

Simone Scardapane@s_scardapane · Jul 17

*Dense Backpropagation Improves Training for Sparse MoEs* by @PandaAshwinee @tomgoldsteincs et al. They modify the top-k router of a MoE by adding a "default" activation for unselected experts in order to have a dense gradient during the backward pass. arxiv.org/abs/2504.12463

s_scardapane's tweet image. *Dense Backpropagation Improves Training for Sparse MoEs*
by @PandaAshwinee @tomgoldsteincs et al.

They modify the top-k router of a MoE by adding a "default" activation for unselected experts in order to have a dense gradient during the backward pass.

arxiv.org/abs/2504.12463

152

6.0K

Simone Scardapane@s_scardapane · Jul 15

*Into the land of automatic differentiation* Material is out! A short PhD course for the CS PhD in @SapienzaRoma covering basic and advanced topics in autodiff w/ slides, (rough) Notion notes, and two notebooks including a PyTorch-like implementation. 😅 sscardapane.it/teaching/phd-a…

s_scardapane's tweet image. *Into the land of automatic differentiation*

Material is out! A short PhD course for the CS PhD in @SapienzaRoma covering basic and advanced topics in autodiff w/ slides, (rough) Notion notes, and two notebooks including a PyTorch-like implementation. 😅

sscardapane.it/teaching/phd-a…

118

752

666

31.0K

Simone Scardapane@s_scardapane · Jul 14

*Perception Encoder* by @cfeichtenhofer et al. To fine-tune a CLIP model with LLM or dense decoders, the best embeddings may not be at the end. They release code, datasets, and SOTA-like models. arxiv.org/abs/2504.13181

s_scardapane's tweet image. *Perception Encoder*
by @cfeichtenhofer et al.

To fine-tune a CLIP model with LLM or dense decoders, the best embeddings may not be at the end. They release code, datasets, and SOTA-like models.

arxiv.org/abs/2504.13181

1.0K

Simone Scardapane@s_scardapane · Jul 11

*Antidistillation Sampling* by @yashsavani_ @ashertrockman @zicokolter et al. They modify the logits of a model with a penalty term that poisons potential distillation attempts (by estimating the downstream distillation loss). arxiv.org/abs/2504.13146

s_scardapane's tweet image. *Antidistillation Sampling*
by @yashsavani_ @ashertrockman @zicokolter et al.

They modify the logits of a model with a penalty term that poisons potential distillation attempts (by estimating the downstream distillation loss).

arxiv.org/abs/2504.13146

2.0K

Simone Scardapane@s_scardapane · Jul 9

*Forgetting Transformer: Softmax Attention with a Forget Gate* by @zhxlin @nikishin_evg @AaronCourville They add a forgetting mechanism to attention by computing a "forget factor" for each token and biasing the attention computation. arxiv.org/abs/2503.02130

s_scardapane's tweet image. *Forgetting Transformer: Softmax Attention with a Forget Gate*
by @zhxlin @nikishin_evg @AaronCourville

They add a forgetting mechanism to attention by computing a "forget factor" for each token and biasing the attention computation.

arxiv.org/abs/2503.02130

226

137

33.0K

Simone Scardapane@s_scardapane · Jul 4

Thanks for the citation and the nice words @paraschopra!

PParas Chopra@paraschopra · Jul 4

This is a fantastic, visually stunning, free introductory book on deep learning. Highly recommended for curious people who want the lay of the land.

4.0K

Simone Scardapane Retweeted

Giorgio@GiorgioMantova · Jul 3

news.ycombinator.com/item?id=444261… Now on ycombinator arxiv.org/abs/2404.17625

636

Simone Scardapane@s_scardapane · Jul 1

*OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens* by @liujc1998 et al. A pipeline to quickly trace parts of the LLM output (verbatim) back to the training documents, in almost real-time, tested on the OLMo family of models. arxiv.org/abs/2504.07096

s_scardapane's tweet image. *OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens*
by @liujc1998 et al.

A pipeline to quickly trace parts of the LLM output (verbatim) back to the training documents, in almost real-time, tested on the OLMo family of models.

arxiv.org/abs/2504.07096

4.0K

Simone Scardapane@s_scardapane · Jun 27

*NoProp: Training Neural Networks without Backpropagation or Forward-propagation* by @yeewhye et al. They use a neural network to define a denoising process over the class labels, which allows them to train the blocks independently (i.e., "no backprop"). arxiv.org/abs/2503.24322

s_scardapane's tweet image. *NoProp: Training Neural Networks without Backpropagation or Forward-propagation*
by @yeewhye et al.

They use a neural network to define a denoising process over the class labels, which allows them to train the blocks independently (i.e., "no backprop").

arxiv.org/abs/2503.24322

273

185

16.0K

Simone Scardapane@s_scardapane · Jun 26

Twitter friends, here's some draft notes for my upcoming course on automatic differentiation, mostly based on the "Elements of differentiable programming" book. Let me know what you think! They also include a notebook on operator overloading. 🙃 notion.so/sscardapane/Au…

s_scardapane's tweet image. Twitter friends, here's some draft notes for my upcoming course on automatic differentiation, mostly based on the "Elements of differentiable programming" book. Let me know what you think! They also include a notebook on operator overloading. 🙃

notion.so/sscardapane/Au…

2.0K

Simone Scardapane@s_scardapane · Jun 19

*Generalized Interpolating Discrete Diffusion* by @dvruette @orvieto_antonio & al. A class of discrete diffusion models combining standard masking with uniform noise to allow the model to potentially "correct" previously wrong tokens. arxiv.org/abs/2503.04482

s_scardapane's tweet image. *Generalized Interpolating Discrete Diffusion*
by @dvruette @orvieto_antonio &amp; al.

A class of discrete diffusion models combining standard masking with uniform noise to allow the model to potentially "correct" previously wrong tokens.

arxiv.org/abs/2503.04482

142

7.0K

Simone Scardapane@s_scardapane · Jun 18

*Differentiable Logic Cellular Automata* by @PietroMiotti @eyvindn @RandazzoEttore @zzznah Combines differentiable cellular automata and logic circuits to learn recurrent circuits exhibiting complex (learned) behavior. google-research.github.io/self-organisin…

s_scardapane's tweet image. *Differentiable Logic Cellular Automata*
by @PietroMiotti @eyvindn @RandazzoEttore @zzznah

Combines differentiable cellular automata and logic circuits to learn recurrent circuits exhibiting complex (learned) behavior.

google-research.github.io/self-organisin…

221

126

11.0K

Simone Scardapane@s_scardapane · Jun 18

*From superposition to sparse codes: interpretable representations in NNs* by @klindt_david @ninamiolane @rpatrik96 @charles0neill Nice overview on the linearity of NN representations and the use of sparse coding to recover interpretable activations. arxiv.org/abs/2503.01824

s_scardapane's tweet image. *From superposition to sparse codes: interpretable representations in NNs*
by @klindt_david @ninamiolane @rpatrik96 @charles0neill

Nice overview on the linearity of NN representations and the use of sparse coding to recover interpretable activations.

arxiv.org/abs/2503.01824

146

7.0K

Simone Scardapane@s_scardapane · Jun 13

*Recursive Inference Scaling* by @ibomohsin @XiaohuaZhai Recursively applying the first part of a model can be a strong compute-efficient baseline in many scenarios, when evaluating on a fixed compute budget. arxiv.org/abs/2502.07503

s_scardapane's tweet image. *Recursive Inference Scaling*
by @ibomohsin @XiaohuaZhai

Recursively applying the first part of a model can be a strong compute-efficient baseline in many scenarios, when evaluating on a fixed compute budget.

arxiv.org/abs/2502.07503

3.0K

Simone Scardapane@s_scardapane · Jun 12

*Universal Sparse Autoencoders* by @HThasarathan @Napoolar @MatthewKowal9 @CSProfKGD They train a shared SAE latent space on several vision encoders at once, showing, e.g., how the same concept activates in different models. arxiv.org/abs/2502.03714

s_scardapane's tweet image. *Universal Sparse Autoencoders*
by @HThasarathan @Napoolar @MatthewKowal9 @CSProfKGD

They train a shared SAE latent space on several vision encoders at once, showing, e.g., how the same concept activates in different models.

arxiv.org/abs/2502.03714

256

153

12.0K