Konstantin Mishchenko

@konstmish

Research Scientist @AIatMeta Previously Researcher @ Samsung AI Outstanding Paper Award @icmlconf 2023 Action Editor @TmlrOrg I tweet about ML papers and math

Paris, France

Joined June 2020

649Following

7KFollowers

Pinned

Konstantin Mishchenko@konstmish · Sep 26

A student reached out asking for advice on research directions in optimization, so I wrote a long response with pointers to interesting papers. I thought it'd be worth sharing it here too: 1. Adaptive optimization. There has been a lot going on in the last year, below are some…

141

904

1.0K

100.0K

Konstantin Mishchenko Retweeted

Micah Goldblum@micahgoldblum · Jul 10

🚨 Did you know that small-batch vanilla SGD without momentum (i.e. the first optimizer you learn about in intro ML) is virtually as fast as AdamW for LLM pretraining on a per-FLOP basis? 📜 1/n

106

813

642

380.0K

Konstantin Mishchenko@konstmish · Jul 8

We just released the best 3B model, 100% open-source, open dataset, architecture details, exact data mixtures and full training recipe including pre-training, mid-training, post-training, and synthetic data generation for everyone to train their own. Let's go open-source AI!

LLoubna Ben Allal@LoubnaBenAllal1 · Jul 8

Introducing SmolLM3: a strong, smol reasoner! > SoTA 3B model > dual mode reasoning (think/no_think) > long context, up to 128k > multilingual: en, fr, es, de, it, pt > fully open source (data, code, recipes) huggingface.co/blog/smollm3

244

2.0K

795

188.0K

Konstantin Mishchenko@konstmish · Jul 8

The thing I like about AdaGrad is not that it estimates the learning rates for each coordinate, but rather that it uses the same stepsize no matter the function class that you minimize. If you use gradient descent you have to know either L from L-smoothness or an upper bound G on…

konstmish's tweet image. The thing I like about AdaGrad is not that it estimates the learning rates for each coordinate, but rather that it uses the same stepsize no matter the function class that you minimize.
If you use gradient descent you have to know either L from L-smoothness or an upper bound G on…

136

9.0K