Jascha Sohl-Dickstein
@jaschasd
Member of the technical staff @ Anthropic. Most (in)famous for inventing diffusion models. AI + physics + neuroscience + dynamics.
My first blog post ever! Be harsh, but, you know, constructive. Too much efficiency makes everything worse: overfitting and the strong version of Goodhart's law sohl-dickstein.github.io/2022/11/06/str… 🧵


This was a fun project! If you could train an LLM over text arithmetically compressed using a smaller LLM as a probabilistic model of text, it would be really good. Text would be represented with far fewer tokens, and inference would be way faster and cheaper. The hard part is…
Ever wonder why we don’t train LLMs over highly compressed text? Turns out it’s hard to make it work. Check out our paper for some progress that we’re hoping others can build on. arxiv.org/abs/2404.03626 With @blester125, @hoonkp, @alemi, Jeffrey Pennington, @ada_rob, @jaschasd
I will be attending ICML next week. Reach out (by email) if you'd like to chat! About Anthropic / research / life. I'm especially interested in meeting grad students who can teach me new research ideas.
This is great, hearing Yang's thought process and motivations for his score matching/diffusion research. (I had forgotten that I tried to convince him that score matching was too local to be useful for generative modeling :/)
Very excited to share our interview with @DrYangSong. This is Part 2 of our history of diffusion series — score matching, the SDE/ODE interpretation, consistency models, and more. Enjoy!
Slater is an excellent interviewer. This was a lot of fun to do. I'm even more excited for the upcoming interviews with @DrYangSong and @sedielem !
Very excited to share our interview with @jaschasd on the history of diffusion models — from his original 2015 paper inventing them, to the GAN "ice age", to the resurgence in diffusion starting with DDPM. Enjoy!
This is an excellent paper, that ties many threads together around scaling models and hyperparameters.
This was one of the most research-enabling libraries I used at Google. If you want to try out LLM ideas with a simple, clean, JAX codebase, this is for you.
We recently open-sourced a relatively minimal implementation example of Transformer language model training in JAX, called NanoDO. If you stick to vanilla JAX components, the code is relatively straightforward to read -- the model file is <150 lines. We found it useful as a…
Here's Claude 3 Haiku running at >200 tokens/s (>2x as fast as prod)! We've been working on capacity optimizations but we can have fun testing those as speed optimizations via overly-costly low batch size. Come work with me at Anthropic on things like this, more info in thread 🧵
I’ve been daydreaming about an AI+audio product that I think recently became possible: virtual noise canceling headphones. I hate loud background noise -- BART trains, airline cabins, road noise, ... 🙉. I would buy the heck out of this product, and would love it if it were built…
An excellent project making evolution strategies much more efficient for computing gradients in dynamical systems.
📝Quiz time: when you have an unrolled computation graph (see figure below), how would you compute the unrolling parameters' gradients? If your answer only contains Backprop, now it’s time to add a new method to your gradient estimation toolbox!
2+2=5? “LLMs are not Robust to Adversarial Arithmetic” a new paper from our team @GoogleDeepMind with @bucketofkets, @culpla, @AlwaysParisi, @gamaleldinfe, @jaschasd, Noah Fiedel TLDR: We ask an LLM to attack itself and find this works extremely well.