tender

@tenderizzation

PRs reverted world champion

South Silly Valley (南湾）

Joined July 2010

579Following

3KFollowers

Pinned

tender@tenderizzation · Jan 14

DM me "hey" I'll debug your CUDA error: an illegal memory access🍒 "hi" I'll debug your cuDNN error: CUDNN_STATUS_BAD_PARAM 🍑 "howdy" I'll debug your CUDA error: CUBLAS_STATUS_EXECUTION_FAILED 🍓

27.0K

tender@tenderizzation · 1 h

people will say this and then you make the default config more optimal only to find out it is broken on GeForce 1050 Ti when the horde of windows huggingface/diffusers users descends on your github project

ssamsja@samsja19 · 5 h

running some workload on torchtitan with default config, 23% mfu, change a bit the config, enable compile and flex, increase batch size a bit --> 58% mfu. I wish we had more performant default in the torch ecosystem

2.0K

tender@tenderizzation · 2 h

and it’s back in (after heroic efforts by @Norapom04 to add a DTensor sharding strategy and to support buggy dtype mismatch behavior that old user code relies on)

ttender@tenderizzation · Jul 21

Aaand it’s reverted (hopefully back in soon though)

524

tender@tenderizzation · 3 h

the corollary to this is that until gdm poaches some high profile folks the cost to convince someone to switch from pytorch to JAX is unbounded

ttender@tenderizzation · 21 h

we can only surmise that it costs $100M to poach an openai researcher but $1B to poach a gdm researcher because that’s how much it costs to convince someone to switch from JAX to pytorch

622

tender@tenderizzation · 13 h

world class AI researcher explaining why their FP4 quantization-aware training recipe is the only one that prevents the model from diverging

//* tidu */@__tidu · Jul 21

"the patient needs horse piano to live"

2.0K

tender@tenderizzation · 17 h

Tri Dao beating one of his own kernels with RL

YYou Jiacheng@YouJiacheng · 18 h

what if Tri Dao is one of the co-authors😏

141

5.0K

tender@tenderizzation · 21 h

if you want to build this you can just start from the code generated by the torch.compile Inductor C++ wrapper and remove dependencies until there are none left </shitpost>

mmike64_t@mike64_t · Jul 22

It's honestly incomprehensible to me that we haven't started writing training solutions like game engines. Stable, well designed abstractions, in a clean zero-dependency C++ project. You know game engines also just build GPU command buffers, right?

2.0K

tender@tenderizzation · Jul 22

to be fair you will see the same kind of stuff at Regeneron (rip Intel) STS and ISEF

DDylan Patel@dylan522p · Jul 22

The children yearn to be working in fabs Taiwan high school science exhibition this year are discussing about 1.5nm Gate All Around transistor structure optimization The kids are unbelievably cracked

2.0K

tender@tenderizzation · Jul 21

contributors adding a new feature and then realizing it has to be compatible with all previous use-cases, including bugs

�🎞️ shitposts.mp4 🎞️@shitposts_mp4 · Jul 19

1.0K

tender@tenderizzation · Jul 21

Aaand it’s reverted (hopefully back in soon though)

ttender@tenderizzation · Jul 19

on the heels of the IMO gold announcement I would like to point out native FusedRMSNorm by @Norapom04 was finally merged in pytorch (and not reverted yet🤞) torch.nn.RMSNorm will automatically pick the faster path with no user code changes multiple times speedup over previous…

5.0K

tender@tenderizzation · Jul 21

back in my day our pile of S-curves was called Moore’s Law and it lasted four decades CMOS, immersion lithography, DUV, EUV, multi-patterning, FinFET, just to name a few but it doesn’t seem like we need four decades of sustained AI progress for it to become self-sustaining

RRoss Taylor@rosstaylor90 · Jul 20

What seems like an exponential in AI is just a series of S curves. Each era rides on a wave of increasing compute but finds a new way to utilise it - overcoming limitations of the previous stage. Eg pre-training was the dominant way to utilise compute, but the limitations of…

5.0K

tender@tenderizzation · Jul 21

the NaN loss making its way from a worker gpu register through global memory, host memory, cpu registers, the NIC, TCP/IP, the wandb server's NIC, its system memory, ..., TCP/IP back to some web browser just so the researcher can kill the experiment

PPokémon Daily Post💫@PokeDailyPost · Jul 20

168

6.0K