Oliver Li
@oliveraochongli
phd in language models at @Cornell_CS; prev @columbianlp
DeepSeek distill Qwen models (including 1.5B and 7B) have 131072 for context windows, whereas the base models (Qwen2.5-Math-1.5B/7B) have 4096. Anyone know if this is a mistake in the base model name or distillation w/ extrapolated rope?

📢Thrilled to share our new paper: Esoteric Language Models (Eso-LMs) > 🔀Fuses autoregressive (AR) and masked diffusion (MDM) paradigms > 🚀First to unlock KV caching for MDMs (65x speedup!) > 🥇Sets new SOTA on generation speed-vs-quality Pareto frontier How? Dive in👇…
Tool-using LLMs can learn to reason—without reasoning traces. 🔥 We present Nemotron-Research-Tool-N1, a family of tool-using reasoning LLMs trained entirely via rule-based reinforcement learning—no reasoning supervision, no distillation. 📄 Paper: arxiv.org/pdf/2505.00024 💻…
Please check out our Qwen3 Technical Report. 👇🏻 github.com/QwenLM/Qwen3/b…
looking at reasoning models vs their non-reasoning base counterparts, i'm curious why reasoning llms use "I" in their thinking, while base llms use "we." is it learned from the sft data or is it somehow naturally emerged from RL?
We wrapped up New England NLP 2025 at @YaleEngineering 🎉! 215 registrations from 37 institutions ✅ An amazing lineup of speakers 🎤 A heated panel 💬 86 posters and 5 oral presentations 🗣️ Attaching some sparkling moments captured by @_Chuhan_Li!
Time to revisit our paper: Open community-driven evaluation platforms could be corrupted from a few sources of bad annotations, making their results not as trustworthy as we'd like. arxiv.org/pdf/2412.04363
how did this llama4 score so high on lmsys?? i'm still buckling up to understand qkv through family reunions and weighted values for loving cats...
Introducing nanoAhaMoment: Karpathy-style, single file RL for LLM library (<700 lines) - super hackable - no TRL / Verl, no abstraction💆♂️ - Single GPU, full param tuning, 3B LLM - Efficient (R1-zero countdown < 10h) comes with a from-scratch, fully spelled out YT video [1/n]
Deepseek just announced Inference-Time Scaling for Generalist Reward Modeling on Hugging Face show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve…
Struggle with generating dialogues from limited norm samples? Check our new paper #emnlp2023 Normdial at arxiv.org/abs/2310.14563…. Thanks @mallika_2011 @skychwang @rkdsaakyan @SmaraMuresanNLP