fluffy

@fluffykittnmeow

toil toil toil

χατγιρλ snow fort

Joined February 2023

331Following

681Followers

fluffy@fluffykittnmeow · Apr 8

Llama 4 analysis v1: 1. Maverick mixes MoE layers & dense - every odd MoE 2. Scout uses L2 Norm on QK (not QK Norm) 3. Both n_experts = 1 4. Official repo uses torch.bmm (not efficient) 5. Maverick layers 1, 3, 45 MoE are "special" layers 6. 8192 chunked attention Details: 1.…

UUnsloth AI@UnslothAI · Apr 8

You can now run Llama 4 on your local device!🦙 We shrank Maverick (402B) from 400GB to 122GB (-70%). Scout: 115GB to 33.8GB (-75%) Our Dynamic 1.78bit GGUFs ensures optimal accuracy by selectively quantizing layers GGUFs: huggingface.co/collections/un… Guide: docs.unsloth.ai/basics/tutoria…

451

189

52.0K

fluffy@fluffykittnmeow · Oct 12

POV: you haven't pulled langchain since tuesday

835

fluffy@fluffykittnmeow · Sep 15

A few weeks ago, @Grad62304977 pointed me towards PaLM's beta2 schedule. While improving convergence across domains, it also requires changes in Adam's algorithm. Today, I'm open-sourcing PaLM-Adam. Give it a try! gist.github.com/ClashLuke/9a00… x.com/Grad62304977/s…

GGrad@Grad62304977 · Aug 11

Thanks! Have u tried the palm beta2 schedule? 1 - k^0.8 where k is the step

6.0K

fluffy Retweeted

Tri Dao@tri_dao · Jul 11, 2024

FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage of modern GPUs. We’re releasing FlashAttention-3: 1.5-2x faster on FP16, up to 740 TFLOPS on H100 (75% util), and FP8 gets close to 1.2 PFLOPS! 1/

337

2.0K

731

298.0K

fluffy Retweeted

Stas Bekman@StasBekman · Jun 12, 2024

The kind folks from @modal_labs have just shared with me this 10-100x faster drop in replacement for pip github.com/astral-sh/uv If you want a much faster CI startup switch to uv now! To use you just add `uv` before `pip` and everything else is the same, so: pip install uv uv…

139

13.0K

fluffy@fluffykittnmeow · Jun 6, 2024

Excited to share what I've been working on as part of the former Superalignment team! We introduce a SOTA training stack for SAEs. To demonstrate that our methods scale, we train a 16M latent SAE on GPT-4. Because MSE/L0 is not the final goal, we also introduce new SAE metrics.

OOpenAI@OpenAI · Jun 6, 2024

We're sharing progress toward understanding the neural activity of language models. We improved methods for training sparse autoencoders at scale, disentangling GPT-4’s internal representations into 16 million features—which often appear to correspond to understandable concepts.…

674

315

287.0K

fluffy Retweeted

Simo Ryu@cloneofsimo · May 6, 2024

Just got the results!!! MMDiT 🤝muP. infinite width never disappoints 🫡 @TheGregYang Gradient norm: never blows up, Loss : never spikes, any scale! Feature updates: Maximal🌊🌊 The code to reproduce this -> github.com/cloneofsimo/mi…

7.0K

fluffy@fluffykittnmeow · Apr 18, 2024

oh hey

10.0K

fluffy@fluffykittnmeow · Apr 18, 2024

15T+ tokens 👀

520

fluffy@fluffykittnmeow · Apr 14, 2024

ancient 2017 laptop (XPS 15 9560) at various amperage limits (-125mV offset), cause why not

445

fluffy Retweeted

Mistral AI@MistralAI · Apr 10, 2024

magnet:?xt=urn:btih:9238b09245d0d8cd915be09927769d5f7584c1c9&dn=mixtral-8x22b&tr=udp%3A%2F%2Fopen.demonii.com%3A1337%2Fannounce&tr=http%3A%2F%https://t.co/OdtBUsbeV5%3A1337%2Fannounce

265

786

6.0K

1.0K

1.8M

fluffy@fluffykittnmeow · Mar 20, 2024

Year of the RTX 4090 pretrain, question mark.

538

fluffy Retweeted

Stability AI@StabilityAI · Feb 22, 2024

Announcing Stable Diffusion 3, our most capable text-to-image model, utilizing a diffusion transformer architecture for greatly improved performance in multi-subject prompts, image quality, and spelling abilities. Today, we are opening the waitlist for early preview. This phase…

240

1.0K

5.0K

917

1.4M

fluffy Retweeted

Tri Dao@tri_dao · Feb 22, 2024

FlashAttention v2.5.5 now supports head dim 256 backward on consumer GPUs. Hope that makes it easier to finetune Gemma models

547

72.0K

fluffy Retweeted

Nathan Lambert@natolambert · Feb 1, 2024

magnet:?xt=urn:btih:7b212968cbf47b8ebcd7017a1e41ac20bf335311&xt=urn:btmh:122043d0d1a79eb31508aacdfe2e237b702f280e6b2a1c121b39763bfecd7268a62d&dn=ai2-model release 49c8647f439c324f564651c83bd945c0140c2750 err not sure you should get models like this but enjoy

545

200

116.0K

fluffy@fluffykittnmeow · Dec 21, 2023

1. Take pretrained LLMs 2. Prompt with "3.14159265358979323846" 3. ??? (circle size == pretraining tokens)

162

30.0K

fluffy@fluffykittnmeow · Dec 1, 2023

Graded: x.com/keirp1/status/…

KKeiran Paster@keirp1 · Dec 1, 2023

Following up on my previous post, I hand-graded held-out* math exams from the recently released Qwen 72B and DeepSeek 67B Base/Chat. It seems like they perform similarly to Claude 2! DeepSeek 67B: 37% GPT-3.5: 41% Qwen 72B: 52% Claude 2: 55% DeepSeek 67B Chat: 56% Grok-1: 59%…

1.0K