Songlin Yang

@SonglinYang4

Ph.D. student @MIT_CSAIL. Working on scalable and principled methods in #ML & #LLM. In open-sourcing I trust 🐳. she/her/hers

Cambridge, MA

Joined January 2021

3KFollowing

12KFollowers

Pinned

Songlin Yang@SonglinYang4 · Jul 22

Recording: youtube.com/watch?v=aNgg6M…

SSonglin Yang@SonglinYang4 · Jul 22

Happening now!

12.0K

Songlin Yang@SonglinYang4 · Jul 22

Since our initial arXiv post, several concurrent papers have introduced new architectures with log-linear properties in various forms. Two personal favorites of mine (among others) are: - Transformer-PSM by @MorrisYau et al., and - Radial Attention by Xingyang and @lmxyy1999 et…

HHan Guo@HanGuo97 · Jun 6

We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels

271

162

19.0K

Songlin Yang@SonglinYang4 · Jul 22

Happening now!

16.0K

Songlin Yang@SonglinYang4 · Jul 21

Hierarchical layout is super elegant. Feels like the right abstraction for high performance GPU kernels. FlashAttention 2 actually started bc we wanted to rewrite FA1 in CuTe

VVijay@__tensorcore__ · Jul 20

developer.nvidia.com/blog/cutlass-p… marks the start of a short series of blogposts about CUTLASS 3.x and CuTe that we've been meaning to write for years. There are a few more parts to come still, hope you enjoy!

155

16.0K

Songlin Yang@SonglinYang4 · Jul 20

Check this out!

LLiliang Ren@liliang_ren · Jul 18

We’re open-sourcing the pre-training code for Phi4-mini-Flash, our SoTA hybrid model that delivers 10× faster reasoning than Transformers — along with μP++, a suite of simple yet powerful scaling laws for stable large-scale training. 🔗 github.com/microsoft/Arch… (1/4)

7.0K

Songlin Yang@SonglinYang4 · Jul 19

Join us tomorrow!

EES-FoMo@ICML2025@ESFoMo · Jul 18

Looking forward to seeing everyone for ES-FoMo part three tomorrow! We'll be in East Exhibition Hall A (the big one), and we've got an exciting schedule of invited talks, orals, and posters planned for you tomorrow. Let's meet some of our great speakers! 1/

5.0K

Songlin Yang Retweeted

Rui-Jie (Ridger) Zhu@RidgerZhu · Jul 10

Resources: Paper: arxiv.org/pdf/2507.06457 CKPT LINK: huggingface.co/collections/m-…

2.0K

Songlin Yang@SonglinYang4 · Jul 17

Model and training code for LaCT on language model, AR video gen and novel view synthesis are released, also have a TTT layer implementation with sequence parallel supported. Both object-centric and scene-level view synthesis checkpoints are released 🤓— come play!

TTianyuan Zhang@tianyuanzhang99 · Jun 3

Bored of linear recurrent memories (e.g., linear attention) and want a scalable, nonlinear alternative? Our new paper “Test-Time Training Done Right” propose LaCT (Large Chunk Test-Time Training) — a highly efficient, massively scalable nonlinear memory with: 💡 Pure PyTorch…

113

9.0K

Songlin Yang Retweeted

Tiezhen WANG@Xianbao_QIAN · Jul 15

@Kimi_Moonshot K2 on @GroqInc, over 300 TPS I'm flying! > Once released, the world is optimizing it for you. Only open source can do this.

3.0K

Songlin Yang@SonglinYang4 · Jul 14

H-Nets are the future.

SSukjun (June) Hwang@sukjun_hwang · Jul 11

H-Net introduces several technical components, including a similarity-score routing module and EMA-based smoothing module, to allow learning discrete chunk boundaries stably. And because it’s fully end-to-end, H-Net can be *recursively iterated* to more stages of hierarchy! 3/

713

460

119.0K

Songlin Yang@SonglinYang4 · Jul 14

impressive results on super long-form speech generation (> 10 minutes)! glad to see that the intuitions here closely track what I wrote about in my blog post about SSMs vs Transformers x.com/_albertgu/stat… 1. SSMs make more sense for long context where coherence matters more…

JJulian Salazar@JulianSlzr · Jul 14

Excited to share Long-Form Speech Generation with Spoken LMs at #ICML2025 (Wed. oral)! We’ll present: - LibriSpeech-Long: new benchmark and evals for long-form generation quality - SpeechSSM: 1st *textless* spoken LMs for expressive *unbounded* speech Listen and learn more: 🧵

124

17.0K

Songlin Yang@SonglinYang4 · Jul 14

Synthetics like associative recall, MQAR are a great guide to building models. Excited to see this work from @nick11roberts to create new LMs!

NNicholas Roberts@nick11roberts · Jul 14

🎉 Excited to share that our paper "Pretrained Hybrids with MAD Skills" was accepted to @COLM_conf 2025! We introduce Manticore - a framework for automatically creating hybrid LMs from pretrained models without training from scratch. 🧵[1/n]

3.0K

Songlin Yang@SonglinYang4 · Jul 11

I played w it for 1h. Went through my usual prompts (math derivations, floating point optimizations, …). It’s a good model, feels comparable to the best frontier models

KKimi.ai@Kimi_Moonshot · Jul 11

🚀 Hello, Kimi K2! Open-Source Agentic Model! 🔹 1T total / 32B active MoE model 🔹 SOTA on SWE Bench Verified, Tau2 & AceBench among open models 🔹Strong in coding and agentic tasks 🐤 Multimodal & thought-mode not supported for now With Kimi K2, advanced agentic intelligence…

428

35.0K

Songlin Yang@SonglinYang4 · Jul 12

I'll be attending ICML until July 20th. Happy to chat—feel free to DM!

115

13.0K

Songlin Yang@SonglinYang4 · Jul 11

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.

SSukjun (June) Hwang@sukjun_hwang · Jul 11

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

183

1.0K

755

184.0K

Songlin Yang Retweeted

Tri Dao@tri_dao · Jul 10

I really like Phil Tillet's framing of different tools having different tradeoffs in productivity and performance: torch compile, triton, CUDA, PTX. It's still early but CuTe-DSL and similar Python-based DSL might bend this curve. And soon we can probably get LLMs to generate…

6.0K

Songlin Yang@SonglinYang4 · Jul 10

Getting mem-bound kernels to speed-of-light isn't a dark art, it's just about getting the a couple of details right. We wrote a tutorial on how to do this, with code you can directly use. Thanks to the new CuTe-DSL, we can hit speed-of-light without a single line of CUDA C++.

WWentao Guo@WentaoGuo7 · Jul 10

🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 🤯 With @tedzadouri and @tri_dao

520

360

47.0K

Songlin Yang Retweeted

Wentao Guo@WentaoGuo7 · Jul 10

318

190

72.0K

Songlin Yang Retweeted

Rui-Jie (Ridger) Zhu@RidgerZhu · Jul 10

Hybrid architectures mix linear & full attention in LLMs. But which linear attention is best? This choice has been mostly guesswork. In our new work, we stop guessing. We trained, open-sourced 72 MODELS (340M & 1.3B) to dissect what truly makes a hybrid model tick🧶

221

145

23.0K

Songlin Yang@SonglinYang4 · Jul 10

I was asked many times lately what repo to use by students who’re working on test-time scaling with slight modified attention or generation workflow (customized reward model /search). HF is a bit too time consuming esp with tons of token generation and Sglang/vllm is a bit hard…

IInfini-AI-Lab@InfiniAILab · Jul 10

🧵 Glad to introduce LiteSys the inference framework we used in📄 Kinetics: Rethinking Test-Time Scaling Laws (arxiv.org/abs/2506.05333) to evaluate test-time scaling (32K+ generated tokens) at scale. If you are: ✅ Looking for an inference framework that's easy to extend. 🐢…

224

151

28.0K