Sukjun (June) Hwang (@sukjun_hwang)

Pinned

S

Sukjun (June) Hwang@sukjun_hwang · Jul 11

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

91

701

5.0K

4.0K

699.0K

Sukjun (June) Hwang Retweeted

M

Mihir Prabhudesai@mihirp98 · Jul 22

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

122

171

976

838

174.0K

S

Sukjun (June) Hwang@sukjun_hwang · Jul 19

I'll be giving the first H-Net talk this afternoon at 4:30-5 PT at the ES-FoMo workshop! come support the fight against Big Token 🙏

EES-FoMo@ICML2025@ESFoMo · Jul 18

Looking forward to seeing everyone for ES-FoMo part three tomorrow! We'll be in East Exhibition Hall A (the big one), and we've got an exciting schedule of invited talks, orals, and posters planned for you tomorrow. Let's meet some of our great speakers! 1/

5

11

136

17

13.0K

Sukjun (June) Hwang Retweeted

G

Gaurav Ghosal@gaurav_ghosal · Jul 17

1/So much of privacy research is designing post-hoc methods to make models mem. free. It’s time we turn that around with architectural changes. Excited to add Memorization Sinks to the transformer architecture this #ICML2025 to isolate memorization during LLM training🧵

1

23

57

31

7.0K

Sukjun (June) Hwang Retweeted

W

Wentao Guo@WentaoGuo7 · Jul 10

🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 🤯 With @tedzadouri and @tri_dao

11

68

318

192

73.0K

S

Sukjun (June) Hwang@sukjun_hwang · Jul 8

Albert articulates really well the trade offs between transformers and SSMs. This is why I work on both

AAlbert Gu@_albertgu · Jul 8

I converted one of my favorite talks I've given over the past year into a blog post. "On the Tradeoffs of SSMs and Transformers" (or: tokens are bullshit) In a few days, we'll release what I believe is the next major advance for architectures.

2

16

166

52

13.0K

Sukjun (June) Hwang Retweeted

A

Albert Gu@_albertgu · Jul 8

I converted one of my favorite talks I've given over the past year into a blog post. "On the Tradeoffs of SSMs and Transformers" (or: tokens are bullshit) In a few days, we'll release what I believe is the next major advance for architectures.

26

114

782

540

113.0K

S

Sukjun (June) Hwang@sukjun_hwang · Jul 7

I really like this result: an elegant framing and solution to significantly improve length generalization in recurrent models at large (RNNs/SSMs/linear attention/etc). This has significant implications for the problems architecture researchers should focus on, IMO

RRicardo Buitrago@rbuit_ · Jul 7

Despite theoretically handling long contexts, existing recurrent models still fall short: they may fail to generalize past the training length. We show a simple and general fix which enables length generalization in up to 256k sequences, with no need to change the architectures!

1

15

139

59

15.0K

Sukjun (June) Hwang Retweeted

R

Ricardo Buitrago@rbuit_ · Jul 7

Despite theoretically handling long contexts, existing recurrent models still fall short: they may fail to generalize past the training length. We show a simple and general fix which enables length generalization in up to 256k sequences, with no need to change the architectures!

5

33

194

118

39.0K