Nolan Dey
@DeyNolan
Research Scientist @ Cerebras Systems
Successfully ported @karpathy's nanoGPT to the new @Apple MLX framework, possibly enabling quick prototyping of training GPT style models on Mac GPUs. Check out the project: github.com/vithursant/nan…. Got a new M3 Pro and wanted to learn about MLX over the holidays lol.
Power Lines paper now out: arxiv.org/abs/2505.13738 TL;DR - we identify how AdamW's weight decay should scale with batch size, dataset size, and model size in LLM pre-training. We also investigate the scaling of both "optimal" and "critical" batch size.
Published "Neuron-based explanations of neural networks sacrifice completeness and interpretability" in TMLR 2025! TL;DR: The most important principal components provide more complete and interpretable explanations than the most important neurons. ndey96.github.io/neuron-explana…
🎉We're excited to announce our joint work with @Cerebras on a new guide to Maximal Update Parameterization (μP) and μTransfer!🎉 This practitioner's guide (and implementation) aims to make μP more accessible and easier to implement for the broader training community. 🧵
So, uh, it turns out that 30+ years of neural net sparsity research have been confounded by optimal hyperparameters varying with sparsity level...
(6/n) Applying SμPar to pretraining a 610M parameter LLM significantly improves loss over SP and μP models due to improved HP tuning.
(1/n) Paper drop: arxiv.org/abs/2405.15743 TLDR: We introduce the sparse maximal update parameterization (SμPar), which ensures optimal HPs remain the same for any width or sparsity level. This dramatically reduces HP tuning costs, allowing SμPar to achieve superior losses. 🧵 👇
📣 Paper drop: Position Interpolation Improves ALiBi Extrapolation We found a simple method to 2x the context length of models that use ALiBi. This lets models like BTLM-3B-8K and MPT-7B-8K run high quality inference at up to 16K with no additional fine tuning. 👇
We just dropped the BTLM-3B-8K paper on arXiv! It distills our recipe for training SOTA LLMs: - Extensively deduplicated dataset (SlimPajama) - Hyperparameter search using muP - Variable sequence length training + ALiBi - Aggressive LR decay arxiv.org/abs/2309.11568
Cerebras BTLM-3B-8K model crosses 1M downloads🤯 It's the #1 ranked 3B language model on @huggingface! A big thanks to all the devs out there building on top of open source models 🙌
📣 New dataset drop! Introducing SlimPajama-627B: the largest extensively deduplicated, multi-corpora, open-source dataset for training large language models. 🧵cerebras.net/blog/slimpajam…
🚨 New podcast: how we made Cerebras-GPT with @DeyNolan and @QuentinAnthon15. A deep look on what it's like to train on Cerebras and the tradeoffs between compute and inference optimal training. youtube.com/watch?v=QmmNgi…