Vlado Boza

@bozavlado

second of his name. Destroyer of ML hype. I also enjoy making neural networks smaller. http://kaggle.com/usamec

Bratislava

Joined February 2012

205Following

2KFollowers

Pinned

Vlado Boza Retweeted

Yiping Lu@2prime_PKU · Jul 25

Anyone knows adam?

243

398

4.0K

468

488.0K

Pinned

Vlado Boza@bozavlado · Jul 14

One would think that Adafactor is just RMSProp with a rank-1 factorized scaling factor. But no. That thing has an LR scheduler, clipping, and scaling by parameter norm built in. And thus, no surprise, replacing Pytorch Adafactor with TIMM Adafactor leads to a total mess. Can…

370

Vlado Boza@bozavlado · 18 h

Still valid in 2025 with torch.compile.

318

Vlado Boza@bozavlado · Jul 25

One of the comments is equivalent to: "Your work is not novel. You are solving the graph 3-coloring problem, but [1,2,3,4] already solved the graph 2-coloring problem".

VVlado Boza@bozavlado · Jul 24

Got NeurIPS reviews. I want to start 3 out of 4 rebuttals with "You fucking idiot..."

384

Vlado Boza@bozavlado · Jul 24

Got NeurIPS reviews. I want to start 3 out of 4 rebuttals with "You fucking idiot..."

2.0K

Vlado Boza@bozavlado · Jul 24

NVIDIA driver and cuda install on Linux runs better now than in 2015. Granted the bar was low. But last time I installed, it ran fine. That was a first. No need to use grub anymore.

PPaul@WomanDefiner · Jul 23

I actually can't think of 1 tech product that works or runs better today than it did in 2015. They have all gotten worse right down to search bars.

1.0K

Vlado Boza@bozavlado · Jul 22

This HRM thing is essentially a perceiver (see image) with some more bells and whistles.

GGuan Wang@makingAGI · Jul 21

🚀Introducing Hierarchical Reasoning Model🧠🤖 Inspired by brain's hierarchical processing, HRM delivers unprecedented reasoning power on complex tasks like ARC-AGI and expert-level Sudoku using just 1k examples, no pretraining or CoT! Unlock next AI breakthrough with…

1.0K

Vlado Boza@bozavlado · Jul 22

Torch.compile is a mess even in single GPU training. When you are not doing completely typical training.

SSemiAnalysis@SemiAnalysis_ · Jul 21

AI researchers when they discovered that torch.compile doesn't scale well to real multi-node production training workloads and is a giant footgun

450

Vlado Boza@bozavlado · Jul 21

Back in the days, going from float32 to float64 meant that your GPU computation would be slow as hell. Now it is the same with going from bfloat16 to float32.

347

Vlado Boza@bozavlado · Jul 20

Inverting 8G * 8G matrix, why not...

329

Vlado Boza Retweeted

Vladimir Vlejd Macko@vlejd · Jul 17

I do model compression and optimization. It is essential to have access to different GPUs and that would be impossible without @vast_ai . Happy to finally meet you guys at #ICML2025 . And thanks a lot for the Nintendo Switch!

638

Vlado Boza@bozavlado · Jul 17

The EU AI Act says that general-purpose AI is above 1e25 FLOPS. What a crappy piece of legislation that thing is...

KKeller Jordan@kellerjordan0 · Jul 17

True, the first ever application of Muon was to break the 3-second barrier in the CIFAR-10 speedrun. For perspective on scale that was a 3e14 flop training; @Kimi_Moonshot's K2 is 3e24 flops, 10 orders of magnitude larger. x.com/_arohan_/statu…

412

Vlado Boza Retweeted

vik@vikhyatk · Jul 16

RL is really sample efficient. We ran a small experiment on Geoguessr. With just 16 images per country, Moondream performs as well as Claude Sonnet. With the full dataset, it beats Sonnet by a decent margin while being orders of magnitude cheaper to run.

632

304

34.0K