Songlin Yang
@SonglinYang4
Ph.D. student @MIT_CSAIL. Working on scalable and principled methods in #ML & #LLM. In open-sourcing I trust 🐳. she/her/hers
Since our initial arXiv post, several concurrent papers have introduced new architectures with log-linear properties in various forms. Two personal favorites of mine (among others) are: - Transformer-PSM by @MorrisYau et al., and - Radial Attention by Xingyang and @lmxyy1999 et…
We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels
Hierarchical layout is super elegant. Feels like the right abstraction for high performance GPU kernels. FlashAttention 2 actually started bc we wanted to rewrite FA1 in CuTe
developer.nvidia.com/blog/cutlass-p… marks the start of a short series of blogposts about CUTLASS 3.x and CuTe that we've been meaning to write for years. There are a few more parts to come still, hope you enjoy!
Check this out!
We’re open-sourcing the pre-training code for Phi4-mini-Flash, our SoTA hybrid model that delivers 10× faster reasoning than Transformers — along with μP++, a suite of simple yet powerful scaling laws for stable large-scale training. 🔗 github.com/microsoft/Arch… (1/4)
Join us tomorrow!
Looking forward to seeing everyone for ES-FoMo part three tomorrow! We'll be in East Exhibition Hall A (the big one), and we've got an exciting schedule of invited talks, orals, and posters planned for you tomorrow. Let's meet some of our great speakers! 1/
Resources: Paper: arxiv.org/pdf/2507.06457 CKPT LINK: huggingface.co/collections/m-…
Model and training code for LaCT on language model, AR video gen and novel view synthesis are released, also have a TTT layer implementation with sequence parallel supported. Both object-centric and scene-level view synthesis checkpoints are released 🤓— come play!
Bored of linear recurrent memories (e.g., linear attention) and want a scalable, nonlinear alternative? Our new paper “Test-Time Training Done Right” propose LaCT (Large Chunk Test-Time Training) — a highly efficient, massively scalable nonlinear memory with: 💡 Pure PyTorch…
@Kimi_Moonshot K2 on @GroqInc, over 300 TPS I'm flying! > Once released, the world is optimizing it for you. Only open source can do this.
H-Nets are the future.
H-Net introduces several technical components, including a similarity-score routing module and EMA-based smoothing module, to allow learning discrete chunk boundaries stably. And because it’s fully end-to-end, H-Net can be *recursively iterated* to more stages of hierarchy! 3/
impressive results on super long-form speech generation (> 10 minutes)! glad to see that the intuitions here closely track what I wrote about in my blog post about SSMs vs Transformers x.com/_albertgu/stat… 1. SSMs make more sense for long context where coherence matters more…
Excited to share Long-Form Speech Generation with Spoken LMs at #ICML2025 (Wed. oral)! We’ll present: - LibriSpeech-Long: new benchmark and evals for long-form generation quality - SpeechSSM: 1st *textless* spoken LMs for expressive *unbounded* speech Listen and learn more: 🧵
Synthetics like associative recall, MQAR are a great guide to building models. Excited to see this work from @nick11roberts to create new LMs!
🎉 Excited to share that our paper "Pretrained Hybrids with MAD Skills" was accepted to @COLM_conf 2025! We introduce Manticore - a framework for automatically creating hybrid LMs from pretrained models without training from scratch. 🧵[1/n]
I played w it for 1h. Went through my usual prompts (math derivations, floating point optimizations, …). It’s a good model, feels comparable to the best frontier models
🚀 Hello, Kimi K2! Open-Source Agentic Model! 🔹 1T total / 32B active MoE model 🔹 SOTA on SWE Bench Verified, Tau2 & AceBench among open models 🔹Strong in coding and agentic tasks 🐤 Multimodal & thought-mode not supported for now With Kimi K2, advanced agentic intelligence…
I'll be attending ICML until July 20th. Happy to chat—feel free to DM!
Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
I really like Phil Tillet's framing of different tools having different tradeoffs in productivity and performance: torch compile, triton, CUDA, PTX. It's still early but CuTe-DSL and similar Python-based DSL might bend this curve. And soon we can probably get LLMs to generate…
Getting mem-bound kernels to speed-of-light isn't a dark art, it's just about getting the a couple of details right. We wrote a tutorial on how to do this, with code you can directly use. Thanks to the new CuTe-DSL, we can hit speed-of-light without a single line of CUDA C++.
🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 🤯 With @tedzadouri and @tri_dao
🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 🤯 With @tedzadouri and @tri_dao
Hybrid architectures mix linear & full attention in LLMs. But which linear attention is best? This choice has been mostly guesswork. In our new work, we stop guessing. We trained, open-sourced 72 MODELS (340M & 1.3B) to dissect what truly makes a hybrid model tick🧶
I was asked many times lately what repo to use by students who’re working on test-time scaling with slight modified attention or generation workflow (customized reward model /search). HF is a bit too time consuming esp with tons of token generation and Sglang/vllm is a bit hard…
🧵 Glad to introduce LiteSys the inference framework we used in📄 Kinetics: Rethinking Test-Time Scaling Laws (arxiv.org/abs/2506.05333) to evaluate test-time scaling (32K+ generated tokens) at scale. If you are: ✅ Looking for an inference framework that's easy to extend. 🐢…