Lorenzo Noci
@lorenzo_noci
PhD in Machine Learning at @ETH working on deep learning theory and principled large-scale AI models.
Pretraining large-depth transformers just got easier! 🚀 HP transfer across model scale ⚡ Compute-efficient pretraining. Super cool collab with @DeyNolan @BCZhang_ @mufan_li @CPehlevan @ShaneBergsma @BorisHanin Joel Hastness @CerebrasSystems
(1/7) @CerebrasSystems Paper drop: arxiv.org/abs/2505.01618 TLDR: We introduce CompleteP, which offers depth-wise hyperparameter (HP) transfer (Left), FLOP savings when training deep models (Middle), and a larger range of compute-efficient width/depth ratios (Right). 🧵 👇
Pass by if you want to know about scaling up your model under distribution shifts of the training data. Take away: muP needs to be tuned to the optimal amount of feature learning that optimizes the forgetting/plasticity trade off.
🚨 Excited to present our new paper at 🇨🇦 #ICML2025! 🚨 "The Importance of Being Lazy: Scaling Limits of Continual Learning" Great collab with @alebreccia99, @glanzillo11 , Thomas Hofmann, @lorenzo_noci. 🧵 1/6
Our research group in the department of Mathematics and CS at the University of Basel (Switzerland) is looking for several PhD candidates and one post-doc who have a theoretical background in optimization and machine learning or practical experience in reasoning. RT please.
Come hear about how transformers perform factual recall using associative memories, and how this emerges in phases during training! #ICLR2025 poster #602 at 3pm today. Lead by @EshaanNichani Link: iclr.cc/virtual/2025/p… Paper: arxiv.org/abs/2412.06538
Come build with us and @OpenAI !!
You're in Zurich or its zone of influence (Lausanne, Paris, BXL, Munich, London, ...) and like AI + Robots? We (@openai) together with @mimicrobotics, @lokirobotics, and Zurich Builds are organizing a hackathon from Fri 9 May afternoon to Sun 11. Limited spots, more below:
Announcing : The 2nd International Summer School on Mathematical Aspects of Data Science EPFL, Sept 1–5, 2025 Speakers: Bach (@BachFrancis) Bandeira Mallat Montanari (@Andrea__M) Peyré (@gabrielpeyre) For PhD students & early-career researchers Application deadline: May 15
Come by at Neurips to hear Hamza present about interesting properties of various feature learning infinite parameter limits of transformer models! Poster in Hall A-C #4804 at 11 AM PST Friday Paper arxiv.org/abs/2405.15712 Work with @hamzatchaudhry and @CPehlevan
Come by poster #2402 East hall at NeurIPS from 11am-2pm Friday to chat about why outlier features emerge during training and how we can prevent them!
Updated camera ready arxiv.org/abs/2405.19279. New results include: - non-diagonal preconditioners (SOAP/Shampoo) minimise OFs compared to diagonal (Adam/AdaFactor) - Scaling to 7B params - showing our methods to reduce OFs translate to PTQ int8 quantisation ease. Check it out!
Systematic empirical analysis of the role of feature learning in continual learning using scaling limits theory. Meet Jacopo in Vancouver :)
🎉 Excited to be in #Vancouver next week for #NeurIPS to present results from my Master’s Thesis at the Scalable Continual Learning Workshop on December 14th! 🚀 Our work investigates the role of scale and training regimes in Continual Learning. What did we find? 👇 1/3
Indeed very useful :)
We collected lecture notes and blog posts by group members about recent topics in deep learning theory here. Hope it is useful! pehlevan.seas.harvard.edu/resources-0
Updated camera ready arxiv.org/abs/2405.19279. New results include: - non-diagonal preconditioners (SOAP/Shampoo) minimise OFs compared to diagonal (Adam/AdaFactor) - Scaling to 7B params - showing our methods to reduce OFs translate to PTQ int8 quantisation ease. Check it out!
Outlier Features (OFs) aka “neurons with big features” emerge in standard transformer training & prevent benefits of quantisation🥲but why do OFs appear & which design choices minimise them? Our new work (+@lorenzo_noci @DanielePaliotta @ImanolSchlag T. Hofmann) takes a look👀🧵
I'm also recruiting PhD/MSc students this coming cycle, with an eye towards applications in drug discovery. cs.toronto.edu/~cmaddis/ DM me or email me if you have any questions at all!
My group has multiple openings both for PhD and Post-doc positions to work in the area of optimization for ML, and deep learning theory. We are looking for people with a strong theoretical background (degree in math, theoretical physics or CS with strong theory emphasis).
Outlier Features (OFs) aka “neurons with big features” emerge in standard transformer training & prevent benefits of quantisation🥲but why do OFs appear & which design choices minimise them? Our new work (+@lorenzo_noci @DanielePaliotta @ImanolSchlag T. Hofmann) takes a look👀🧵
[1/n] Thrilled that this project with @jzavatoneveth and @cpehlevan is finally out! Our group has spent a lot of time studying high dimensional regression and its connections to scaling laws. All our results follow easily from a single central theorem 🧵 arxiv.org/abs/2405.00592
From stochastic parrot 🦜 to Clever Hans 🐴? In our work with @_vaishnavh we carefully analyse the debate surrounding next-token prediction and identify a new failure of LLMs due to teacher-forcing 👨🏻🎓! Check out our work arxiv.org/abs/2403.06963 and the linked thread!
🗣️ “Next-token predictors can’t plan!” ⚔️ “False! Every distribution is expressible as product of next-token probabilities!” 🗣️ In work w/ @GregorBachmann1 , we carefully flesh out this emerging, fragmented debate & articulate a key new failure. 🔴 arxiv.org/abs/2403.06963