Sulin Liu
@su_lin_liu
Postdoc @MIT Ex: Machine Learning PhD @Princeton @Meta @NTUsg @NUSingapore
Discrete generative models use denoisers for generation, but they can slip up. What if generation *isn’t only* about denoising?🤔 Introducing DDPD: Discrete Diffusion with Planned Denoising🤗🧵(1/11) w/ @junonam_ @AndrewC_ML @HannesStaerk @xuyilun2 Tommi Jaakkola @RGBLabMIT

🚀 Meet EvaByte: The best open-source tokenizer-free language model! Our 6.5B byte LM matches modern tokenizer-based LMs with 5x less data & 2x faster decoding, naturally extending to multimodal tasks while fixing tokenization quirks. 💻 Blog: bit.ly/3CjEmTC 🧵 1/9
Our lab had a #dogathon 🐕 yesterday where we analyzed NYC Open Data on dog licenses. We learned a lot of dog facts, which I’ll share in this thread 🧵 1) Geospatial trends: Cavalier King Charles Spaniels are common in Manhattan; the opposite is true for Yorkshire Terriers.
which train are you on?🚄🚇🚆 (also me: we need faster trains in the states 😶)

LLaDA with muP. it just works, again. Im so tired of saying it works. Just use it, and thank me later
i think that all the pre-training is dead takes are bad. the issue with these big big models is that they are capped by dogwater human-labeled post-training data. we shall continue to scale by exploiting verified RL. excited to see gpt-4.5 be used as a base for the next o model.
LLMs have complex joint beliefs about all sorts of quantities. And my postdoc @jamesrequeima visualized them! In this thread we show LLM predictive distributions conditioned on data and free-form text. LLMs pick up on all kinds of subtle and unusual structure: 🧵
Excited to share that I’ve been working on scaling up diffusion language models at Inception. A new generation of LLMs with unprecedented capabilities is coming!
We are excited to introduce Mercury, the first commercial-grade diffusion large language model (dLLM)! dLLMs push the frontier of intelligence and speed with parallel, coarse-to-fine text generation.
grok also tends to do more solution verification at the end of the solution than chatgpt. Clearly this cannot be baked in through just RL from verifiable reward...
Discrete diffusion (including masked language model) deserves more investment in terms of research and compute, especially when we are running out of pre-training data for autoregressive LLMs. You can get a lot more data for free by just masking data or perturbing them with…
This is really insane. They took all the bet and scaled up discrete diffusion model to llama-7B scale. IIRC nobody dared to do this at this scale but these madlads done it. They even fine-tuned it to be a dialogue model. This is really frontier-level shit that is genuinely new…
I can’t begin to imagine how strong Anthropic’s internal models must be, since Claude was by far the strongest of the standard non-reasoning models: it’s the only one who could escape getting stuck in loops, a recurrent problem that every other LLM has not overcome
Excited about this new work where we dig into the role of token order in masked diffusions! MDMs train on some horribly hard tasks, but careful planning at inference can sidestep the hardest ones, dramatically improving over vanilla MDM sampling (e.g. 7%->90% acc on Sudoku) 1/
Check out new paper on how to do planning for discrete diffusion 👏 Really exciting to see more exploration in this direction🔥
New Paper Alert! 🚀 We introduce Path Planning (P2), a sampling approach to optimizing token unmasking order in Masked Diffusion Models (MDMs). SOTA results across language, math, code, and biological sequence (Protein and RNA)—all without training. arxiv.org/pdf/2502.03540 🧵👇
This quirky topic summarization (edge case?) somehow made my day😂
