Nicolas Zucchet
@NicolasZucchet
PhD student @CSatETH prev. student researcher @GoogleDeepMind | @Polytechnique
🧵What if emergence could be explained by learning a specific circuit: sparse attention? Our new work explores this bold hypothesis, showing a link between emergence and sparse attention that reveals how data properties influence when emergence occurs during training.

Some nice analysis by Nicolas & Francesco of a clear case of emergence — and how to accelerate its acquisition!
🧵What if emergence could be explained by learning a specific circuit: sparse attention? Our new work explores this bold hypothesis, showing a link between emergence and sparse attention that reveals how data properties influence when emergence occurs during training.
How do language models generalize from information they learn in-context vs. via finetuning? We show that in-context learning can generalize more flexibly, illustrating key differences in the inductive biases of these modes of learning — and ways to improve finetuning. Thread: 1/
Super excited to host a student researcher together with @oswaldjoh this year! Please sign up if you wanna have some research fun with us :)
We are hosting a student researcher this year at the Paradigms of Intelligence team at Google! Interested in working with @ninoscherrer and me on AGI, or whatever you think is the next big thing 🥰, please consider applying! docs.google.com/forms/u/2/d/e/…
Super happy and proud to share our novel scalable RNN model - the MesaNet! This work builds upon beautiful ideas of 𝗹𝗼𝗰𝗮𝗹𝗹𝘆 𝗼𝗽𝘁𝗶𝗺𝗮𝗹 𝘁𝗲𝘀𝘁-𝘁𝗶𝗺𝗲 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 (TTT), and combines ideas of in-context learning, test-time training and mesa-optimization.
Emergence in transformers is a real phenomenon! Behaviors and capabilities can appear in models in sudden ways. Emergence is not always just a "mirage". Compiling some examples here (please share any I missed): 🧵
We have a new SSM theory paper, just accepted to COLT, revisiting recall properties of linear RNNs. It's surprising how much one can delve into, and how beautiful it can become. With (and only thanks to) the amazing Alexandre and @BachFrancis arxiv.org/pdf/2502.09287
Smooth predictable scaling laws are central to our conceptions and forecasts about AI -- but lots of capabilities actually *emerge* in sudden ways. Awesome work by @NicolasZucchet @dngfra bringing more predictability to emergent phenomena, by studying one type: sparse attention
🧵What if emergence could be explained by learning a specific circuit: sparse attention? Our new work explores this bold hypothesis, showing a link between emergence and sparse attention that reveals how data properties influence when emergence occurs during training.
I really like this new op ed from @DavidDuvenaud on how so many different kinds of pressures could drive towards loss of human control over AI. It's rare to read anything well written on this topic but this piece was elegant and smart enough that I wanted to keep on reading.
This is just a reminder for your NeurIPS experiments: if you are comparing architectures, optimizers, or whatever at a single hyperparameter setting (e.g., LR), you are automatically not a scientist. You can be better than this. Produce science, not hype.
Our new paper sheds light on the process of knowledge acquisition in language models, with implications for - data curricula - the challenges of learning new knowledge when fine-tuning - the emergence of hallucinations. Nicolas did a great job on the project! See his thread👇
Large language models store vast amounts of knowledge, but how exactly do they learn it? Excited to share my @GoogleDeepMind internship results, which reveal the fascinating dynamics behind factual knowledge acquisition in LLMs! arxiv.org/abs/2503.21676
Google DeepMindより、LLMの知識獲得プロセスを解明した論文が出た。 LLMの学習初期には知識獲得の停滞期(プラトー期)が存在する。 だが実は、この期間に特定の要素に着目し、知識獲得を行う効率的な注意パターンを確立。そして急速な知識獲得を始める。 これは幼児の知識獲得プロセスと類似する。
How LLMs acquire factual knowledge during training remains unclear. This paper investigates these learning dynamics using synthetic biographies, revealing a three-phase process where models first learn statistics, plateau while forming attention circuits, and finally acquire…