BlinkDL
@BlinkDL_AI
RWKV = 100% RNN with GPT-level performance. https://lfaidata.foundation/projects/rwkv and https://github.com/search?o=desc&q=rwkv&s=updated&type=Repositories
RNN+Pretrain+Scaling is all you need. Introducing RWKV-7 G0 🪿 7.2B, the strongest pure RNN reasoning model (can self-correct math mistakes). Download & Details: github.com/BlinkDL/RWKV-L… and it's only +2T tokens - I am training stronger RNNs🙂
RWKV7-G1 "GooseOne" 🪿 2.9B release: pure RNN (attention-free) reasoning model, +5.2T tokens, comparable with Qwen2.5 3B / Llama3.2 3B and fully multilingual. Chat demo & weights on RWKV.com 7B training in progress.
Songlin blocked me on X and banned me from FLA discord. I guess she truly wants her side of the story to keep 🙃 You can't change history, can you?
So now Songlin is mad. It began when I saw an obviously wrong MQAR result of RWKV-7 posted by Songlin (see x.com/BlinkDL_AI/sta……). I told Songlin to use RWKV-LM, and got a very fierce reply in official FLA group. Songlin pinned the personal attack for several days.🙃
So I tested some AIME (including a modified AIME2025 question to detect memorization) and it's quite amazing that a pure RNN can solve easy ones. So RNN getting IMO gold is certainly possible after further scaling🤣
RNN+Pretrain+Scaling is all you need. Introducing RWKV-7 G0 🪿 7.2B, the strongest pure RNN reasoning model (can self-correct math mistakes). Download & Details: github.com/BlinkDL/RWKV-L… and it's only +2T tokens - I am training stronger RNNs🙂
p.s. I think arXiv papers can be the next source of reasoning data: (1) Locate difficult yet predictable tokens (2) Use them for RL (3) "Solving" papers will be more than enough to solve the badly-named "Humanity's Last Exam"🙂
RNN+Pretrain+Scaling is all you need. Introducing RWKV-7 G0 🪿 7.2B, the strongest pure RNN reasoning model (can self-correct math mistakes). Download & Details: github.com/BlinkDL/RWKV-L… and it's only +2T tokens - I am training stronger RNNs🙂
And 15 new RWKV papers in June🙂check rwkv.com (total 108 RWKV papers now)
RWKV-8 "Heron" preview (2) - DeepEmbedAttention (DEA), particularly suitable for hybrid models (1/9 KV cache size of MLA). The goal of RWKV-8 is to achieve longctx with 0 KV cache, and I have some progress too🙂
You can add empty "think" for RWKV7-G1 to get higher quality response while saving tokens.
RWKV7-G1 "GooseOne" 🪿 2.9B release: pure RNN (attention-free) reasoning model, +5.2T tokens, comparable with Qwen2.5 3B / Llama3.2 3B and fully multilingual. Chat demo & weights on RWKV.com 7B training in progress.
RWKV papers rwkv.com : 15 new in Apr/May 2025 🔥 DualComp using RWKV-7 for efficient compression, and RWKVQuant doing 3.275bit. RWKV-7 "Goose" 🪿 is 100% RNN and efficiently test-time-training its state via in-context gradient descent at every token in parallel.
RWKV papers on rwkv.com : 13 new papers in Mar 2025 🔥 RWKV-7 "Goose" 🪿 is 100% RNN and a meta-in-context learner, efficiently test-time-training its state on the context via in-context gradient descent at every token in parallel.
Try RWKV-8 DeepEmbed if you haven't 🔥 Better than Gemma3n PLE, and easier to use too.
RWKV-8 "Heron" preview (1) - DeepEmbed. Seems Gemma3n is trying similar tricks (Per-Layer Embedding), so I will discuss it first 🪶 It's essentially free performance - lots of params, but can be offloaded to RAM/SSD, and simple to train and deploy🚀