Nolan Dey

@DeyNolan

Research Scientist @ Cerebras Systems

Toronto

Joined March 2022

36Following

435Followers

Pinned

Nolan Dey Retweeted

Vithu Thangarasa@vithursant19 · Dec 31, 2023

Successfully ported @karpathy's nanoGPT to the new @Apple MLX framework, possibly enabling quick prototyping of training GPT style models on Mac GPUs. Check out the project: github.com/vithursant/nan…. Got a new M3 Pro and wanted to learn about MLX over the holidays lol.

45.0K

Nolan Dey Retweeted

Shane Bergsma@ShaneBergsma · May 21

Power Lines paper now out: arxiv.org/abs/2505.13738 TL;DR - we identify how AdamW's weight decay should scale with batch size, dataset size, and model size in LLM pre-training. We also investigate the scaling of both "optimal" and "critical" batch size.

9.0K

Nolan Dey@DeyNolan · Apr 2

Published "Neuron-based explanations of neural networks sacrifice completeness and interpretability" in TMLR 2025! TL;DR: The most important principal components provide more complete and interpretable explanations than the most important neurons. ndey96.github.io/neuron-explana…

225

Nolan Dey Retweeted

EleutherAI@AiEleuther · Sep 23

🎉We're excited to announce our joint work with @Cerebras on a new guide to Maximal Update Parameterization (μP) and μTransfer!🎉 This practitioner's guide (and implementation) aims to make μP more accessible and easier to implement for the broader training community. 🧵

182

103

24.0K

Nolan Dey@DeyNolan · Jun 3, 2024

So, uh, it turns out that 30+ years of neural net sparsity research have been confounded by optimal hyperparameters varying with sparsity level...

CCerebras@CerebrasSystems · May 31, 2024

(6/n) Applying SμPar to pretraining a 610M parameter LLM significantly improves loss over SP and μP models due to improved HP tuning.

115

74.0K

Nolan Dey Retweeted

Cerebras@CerebrasSystems · May 31, 2024

(1/n) Paper drop: arxiv.org/abs/2405.15743 TLDR: We introduce the sparse maximal update parameterization (SμPar), which ensures optimal HPs remain the same for any width or sparsity level. This dramatically reduces HP tuning costs, allowing SμPar to achieve superior losses. 🧵 👇

177

112

29.0K

Nolan Dey Retweeted

Cerebras@CerebrasSystems · Oct 23, 2023

📣 Paper drop: Position Interpolation Improves ALiBi Extrapolation We found a simple method to 2x the context length of models that use ALiBi. This lets models like BTLM-3B-8K and MPT-7B-8K run high quality inference at up to 16K with no additional fine tuning. 👇

21.0K

Nolan Dey Retweeted

Cerebras@CerebrasSystems · Sep 22, 2023

We just dropped the BTLM-3B-8K paper on arXiv! It distills our recipe for training SOTA LLMs: - Extensively deduplicated dataset (SlimPajama) - Hyperparameter search using muP - Variable sequence length training + ALiBi - Aggressive LR decay arxiv.org/abs/2309.11568

112

12.0K

Nolan Dey Retweeted

Cerebras@CerebrasSystems · Aug 16, 2023

Cerebras BTLM-3B-8K model crosses 1M downloads🤯 It's the #1 ranked 3B language model on @huggingface! A big thanks to all the devs out there building on top of open source models 🙌

195

50.0K

Nolan Dey Retweeted

Cerebras@CerebrasSystems · Jun 9, 2023

📣 New dataset drop! Introducing SlimPajama-627B: the largest extensively deduplicated, multi-corpora, open-source dataset for training large language models. 🧵cerebras.net/blog/slimpajam…

184

663

219

208.0K

Nolan Dey Retweeted

Cerebras@CerebrasSystems · Apr 13, 2023

🚨 New podcast: how we made Cerebras-GPT with @DeyNolan and @QuentinAnthon15. A deep look on what it's like to train on Cerebras and the tradeoffs between compute and inference optimal training. youtube.com/watch?v=QmmNgi…

23.0K