Chunyuan Deng
@ChunyuanDeng
Ph.D. Student @RiceCompSci.
🚀 Introducing Prefix-RFT to blend SFT and RFT! SFT can learn more complex problems by mimicking, but can have poor generalization. RFT has better overall performance but is limited by the initial policy. Our method, Prefix-RFT, makes the best of both worlds!
Do you ever wish all LLMs used the same tokenizer?🧑🤝🧑 We present an *efficient, lossless* method to convert any LM into a byte-level model at inference time. This fixes weird tokenization artifacts at the prompt boundary and enables ensembles of LMs with mismatched tokenizers! 🧵
I’m at #ICML this week !🍁🍁🍁 @hanjie_chen and I will present the work on Wed 4:30 pm poster session (July 16th), feel free to stand by if you are also interested in steering & controlling! 😃
[#ICML2025] If steering vector can be multiplied by a scalar to control its effect, why it should be a vector instead of a steering "region"? We introduce distribution-wise intervention, a simple method that directly learns the latent intervention region through…
I converted one of my favorite talks I've given over the past year into a blog post. "On the Tradeoffs of SSMs and Transformers" (or: tokens are bullshit) In a few days, we'll release what I believe is the next major advance for architectures.
We have finally released the 📝paper for 🥂FineWeb2, our large multilingual pre-training dataset. Along with general (and exhaustive) multilingual work, we introduce a concept that can also improve English performance: deduplication-based upsampling, which we call rehydration.
Thanks for sharing our work!!!🙏Code release is in progress😺
DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation Apple introduces DiffuCoder, a 7B diffusion LLM trained on 130B tokens of code authors also propose a diffusion-native RL training framework, coupled-GRPO Decoding of dLLMs differ from…
From Bytes to Ideas Avoids using predefined vocabs and memory-heavy embedding tables. Instead, it uses Autoregressive U-Nets to embed information directly from raw bytes. This is huge! Enables infinite vocab size and more. More in my notes below:
Excited to share our #ICML2025 paper, led by my student @ChunyuanDeng🥳 Model intervention offers lightweight control over predictive behavior. Can we make it more effective and robust?🤔 ✅Check out our work on learning stochastic interventions in latent representation space✨
[#ICML2025] If steering vector can be multiplied by a scalar to control its effect, why it should be a vector instead of a steering "region"? We introduce distribution-wise intervention, a simple method that directly learns the latent intervention region through…
Generally feel like QuietStar reasoning may be better to put at a more abstracted level. Also would love to see some FLOPs-fixed comparisons w/ NTP if we are truly talking whether this is a *cherry* cake or not
⏰ We introduce Reinforcement Pre-Training (RPT🍒) — reframing next-token prediction as a reasoning task using RLVR ✅ General-purpose reasoning 📑 Scalable RL on web corpus 📈 Stronger pre-training + RLVR results 🚀 Allow allocate more compute on specific tokens
🚀 Can LLMs stop overthinking when detailed reasoning isn't needed? Excited to share our latest work on LLM reasoning: AutoL2S 🧠⚡ 📄 Paper: arxiv.org/abs/2505.22662 🤖 Model: huggingface.co/amandaa/AutoL2… LLMs often overthink—generating unnecessarily long CoTs even for easy…