Julien Siems

@julien_siems

PhD student advised by Frank Hutter working on linear RNNs and state-tracking.

Germany

Joined July 2022

659Following

306Followers

Pinned

Julien Siems@julien_siems · Mar 28

1/9 There is a fundamental tradeoff between parallelizability and expressivity of Large Language Models. We propose a new linear RNN architecture, DeltaProduct, that can effectively navigate this tradeoff. Here's how!

julien_siems's tweet image. 1/9 There is a fundamental tradeoff between parallelizability and expressivity of Large Language Models. We propose a new linear RNN architecture, DeltaProduct, that can effectively navigate this tradeoff. Here's how!

186

156

34.0K

Julien Siems Retweeted

Jake Robertson@JakeMRobertson · Jun 10

We present a new approach to causal inference. Pre-trained on synthetic data, Do-PFN opens the door to a new domain: PFNs for causal inference—we are excited to announce our new paper “Do-PFN: In-Context Learning for Causal Effect Estimation” on Arxiv! 🔨🔍 A thread:

4.0K

Julien Siems Retweeted

Riccardo Grazzi@riccardograzzi · Jun 13

📖 (1/n) DeltaProduct's theory got an update! 1) For any nₕ>1 (# of Householders), only 3 layers are needed to solve all group word problems (including S5). DeltaNet and RWKV-7 use 4. 2) For any nₕ, Gated DeltaProduct can recognize any regular language

634

Julien Siems@julien_siems · Jun 13

⚡DeltaProduct update with new results: - Characterization of DeltaProduct’s state-tracking ability - Inspection of the hidden state’s effective rank sheds light on why DeltaProduct extrapolates better to longer sequences than DeltaNet. - Improved scaling analysis And more!

JJulien Siems@julien_siems · Mar 28

6.0K

Julien Siems Retweeted

Ali Behrouz@behrouz_ali · May 30

What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)? Can we strictly generalize Transformers? Presenting Atlas (A powerful Titan): a new architecture with long-term in-context memory that learns how to…

139

930

986

106.0K

Julien Siems Retweeted

Songlin Yang@SonglinYang4 · May 24

📢 (1/16) Introducing PaTH 🛣️ — a RoPE-free contextualized position encoding scheme, built for stronger state tracking, better extrapolation, and hardware-efficient training. PaTH outperforms RoPE across short and long language modeling benchmarks arxiv.org/abs/2505.16381

528

325

71.0K

Julien Siems@julien_siems · Apr 29

RWKV7-G1 "GooseOne" 🪿 1.5B release: pure RNN (attention-free) reasoning model, comparable with Qwen3 1.7B and fully multilingual. Chat demo & download on RWKV.com Larger G1 training in progress.

BBlinkDL@BlinkDL_AI · Apr 8

RWKV papers on rwkv.com : 13 new papers in Mar 2025 🔥 RWKV-7 "Goose" 🪿 is 100% RNN and a meta-in-context learner, efficiently test-time-training its state on the context via in-context gradient descent at every token in parallel.

174

33.0K

Julien Siems Retweeted

Xiaolong Wang@xiaolonw · Apr 7

Test-Time Training (TTT) is now on Video! And not just a 5-second video. We can generate a full 1-min video! TTT module is an RNN module that provides an explicit and efficient memory mechanism. It models the hidden state of an RNN with a machine learning model, which is updated…

178

1.0K

742

184.0K

Julien Siems Retweeted

Maximilian Beck@maxmbeck · Mar 19

Yesterday, we shared the details on our xLSTM 7B architecture. Now, let's go one level deeper🧑‍🔧 We introduce ⚡️Tiled Flash Linear Attention (TFLA), ⚡️ A new kernel algorithm for the mLSTM and other Linear Attention variants with Gating. We find TFLA is really fast! 🧵(1/11)

345

211

44.0K

Julien Siems Retweeted

Riccardo Grazzi@riccardograzzi · Mar 28

In our DeltaProduct work we also add a bit of theory to DeltaNet, showing that it can solve Dihedral groups, which are the groups of symmetries of regular polygons, with only two layers. This includes S3 (symmetries of the equilateral triangle).

2.0K

Julien Siems Retweeted

Maximilian Beck@maxmbeck · Mar 18

📢🔔I am excited to share the details on our optimized xLSTM architecture for our xLSTM 7B model!🚨 We optimized the architecture with two goals in mind: - Efficiency (in Training and Inference) and - Stability 🧵(1/7)

329

182

44.0K