Sukjun (June) Hwang
@sukjun_hwang
ML PhD student @mldcmu advised by @_albertgu
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
馃毃 The era of infinite internet data is ending, So we ask: 馃憠 What鈥檚 the right generative modelling objective when data鈥攏ot compute鈥攊s the bottleneck? TL;DR: 鈻讹笍Compute-constrained? Train Autoregressive models 鈻讹笍Data-constrained? Train Diffusion models Get ready for 馃た 1/n
I'll be giving the first H-Net talk this afternoon at 4:30-5 PT at the ES-FoMo workshop! come support the fight against Big Token 馃檹
Looking forward to seeing everyone for ES-FoMo part three tomorrow! We'll be in East Exhibition Hall A (the big one), and we've got an exciting schedule of invited talks, orals, and posters planned for you tomorrow. Let's meet some of our great speakers! 1/
1/So much of privacy research is designing post-hoc methods to make models mem. free. It鈥檚 time we turn that around with architectural changes. Excited to add Memorization Sinks to the transformer architecture this #ICML2025 to isolate memorization during LLM training馃У
馃馃殌QuACK馃馃殌: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 馃く With @tedzadouri and @tri_dao
Albert articulates really well the trade offs between transformers and SSMs. This is why I work on both
I converted one of my favorite talks I've given over the past year into a blog post. "On the Tradeoffs of SSMs and Transformers" (or: tokens are bullshit) In a few days, we'll release what I believe is the next major advance for architectures.
I converted one of my favorite talks I've given over the past year into a blog post. "On the Tradeoffs of SSMs and Transformers" (or: tokens are bullshit) In a few days, we'll release what I believe is the next major advance for architectures.
I really like this result: an elegant framing and solution to significantly improve length generalization in recurrent models at large (RNNs/SSMs/linear attention/etc). This has significant implications for the problems architecture researchers should focus on, IMO
Despite theoretically handling long contexts, existing recurrent models still fall short: they may fail to generalize past the training length. We show a simple and general fix which enables length generalization in up to 256k sequences, with no need to change the architectures!
Despite theoretically handling long contexts, existing recurrent models still fall short: they may fail to generalize past the training length. We show a simple and general fix which enables length generalization in up to 256k sequences, with no need to change the architectures!