David (drbh) Holtz
@justdrbh
Imaginary numbers are real. ML Ops at 🤗
Excited to share the Kernel Hub, optimized CUDA kernels, plug-and-play from the Hugging Face Hub. No boilerplate, just speed. huggingface.co/blog/hello-hf-…

Luminal can discover flash attention entirely automatically. We've been working towards this north star in our search compiler. Check out the prototype demo below ↓
Here's the worlds first AMD GPU driven over USB3. From a Mac! Linux and Windows should work too, it's just libusb. Available today in tinygrad master, use an ADT-UT3G to connect the GPU to your USB port. You have no idea of the level of engineering that went into this.
Chip specs: 1 FP16 PFLOP from - 256 tinycores (VLIW, in-order) - 128 KB local SRAM each (L1) - 1024-bit datapath - 8x8 * 8x8 = 8x8 tensor core - dual issue ALU - @ 2 Ghz - Flexible DMA engines for L2 -> L1 - open source silicon-verified HDL 128 MB global 5 TB/s SRAM (L2) 512-bit…
If you can quantize your model maybe you didn't train it enough.
Everything you love about generative models — now powered by real physics! Announcing the Genesis project — after a 24-month large-scale research collaboration involving over 20 research labs — a generative physics engine able to generate 4D dynamical worlds powered by a physics…
TGI v3 is here: 3x more tokens, 13x faster than vLLM on long prompts, and zero config to get started. If you’re working with large inputs or need serious speed. Check it out! 🤗 repo: github.com/huggingface/te… docs: huggingface.co/docs/text-gene… hugging chat: huggingface.co/chat/

Introducing Willow, our new state-of-the-art quantum computing chip with a breakthrough that can reduce errors exponentially as we scale up using more qubits, cracking a 30-year challenge in the field. In benchmark tests, Willow solved a standard computation in <5 mins that would…
This is (good) modern art
Only 15% of people believed that a real HTTP server could be done in under 200 LOC in assembly. Here is my MacOS arm assembly version that includes: - primitive routing - real configuration (like port is not hardcoded) - a lot of comments and still is under 200 LOC despite a…
Releasing INTELLECT-1: We’re open-sourcing the first decentralized trained 10B model: - INTELLECT-1 base model & intermediate checkpoints - Pre-training dataset - Post-trained instruct models by @arcee_ai - PRIME training framework - Technical paper with all details
🚨 New post 🚨 In my latest post, we iteratively improve on positional encoding schemes to discover RoPE entirely independently!
Our Llama 3.1 405B is now openly available! After a year of dedicated effort, from project planning to launch reviews, we are thrilled to open-source the Llama 3 herd of models and share our findings through the paper: 🔹Llama 3.1 405B, continuously trained with a 128K context…
Runways AI Film Festival was awesome! Really cool to see how many different kinds of videos are being made (animated, photorealistic, a combination of both). Can’t wait to see what else gets made this year!
