Oliver Li

@oliveraochongli

phd in language models at @Cornell_CS; prev @columbianlp

New York

Joined January 2022

1KFollowing

227Followers

Oliver Li@oliveraochongli · Jun 6

DeepSeek distill Qwen models (including 1.5B and 7B) have 131072 for context windows, whereas the base models (Qwen2.5-Math-1.5B/7B) have 4096. Anyone know if this is a mistake in the base model name or distillation w/ extrapolated rope?

oliveraochongli's tweet image. DeepSeek distill Qwen models (including 1.5B and 7B) have 131072 for context windows, whereas the base models (Qwen2.5-Math-1.5B/7B) have 4096. Anyone know if this is a mistake in the base model name or distillation w/ extrapolated rope?

192

Oliver Li Retweeted

Zhihan Yang@zhihanyang_ · Jun 3

📢Thrilled to share our new paper: Esoteric Language Models (Eso-LMs) > 🔀Fuses autoregressive (AR) and masked diffusion (MDM) paradigms > 🚀First to unlock KV caching for MDMs (65x speedup!) > 🥇Sets new SOTA on generation speed-vs-quality Pareto frontier How? Dive in👇…

275

238

27.0K

Oliver Li Retweeted

Shaokun Zhang@ShaokunZhang1 · May 13

Tool-using LLMs can learn to reason—without reasoning traces. 🔥 We present Nemotron-Research-Tool-N1, a family of tool-using reasoning LLMs trained entirely via rule-based reinforcement learning—no reasoning supervision, no distillation. 📄 Paper: arxiv.org/pdf/2505.00024 💻…

355

288

34.0K

Oliver Li@oliveraochongli · Jun 1

gotta fuel up on carbs before doing any thinking

118

Oliver Li Retweeted

Qwen@Alibaba_Qwen · May 13

Please check out our Qwen3 Technical Report. 👇🏻 github.com/QwenLM/Qwen3/b…

302

2.0K

538

202.0K

Oliver Li@oliveraochongli · May 8

looking at reasoning models vs their non-reasoning base counterparts, i'm curious why reasoning llms use "I" in their thinking, while base llms use "we." is it learned from the sft data or is it somehow naturally emerged from RL?

153

Oliver Li Retweeted

Sophia Simeng Han@HanSineng · Apr 15

We wrapped up New England NLP 2025 at @YaleEngineering 🎉! 215 registrations from 37 institutions ✅ An amazing lineup of speakers 🎤 A heated panel 💬 86 posters and 5 oral presentations 🗣️ Attaching some sparkling moments captured by @_Chuhan_Li!

5.0K

Oliver Li@oliveraochongli · Apr 6

Time to revisit our paper: Open community-driven evaluation platforms could be corrupted from a few sources of bad annotations, making their results not as trustworthy as we'd like. arxiv.org/pdf/2412.04363

SSusan Zhang@suchenzang · Apr 6

how did this llama4 score so high on lmsys?? i'm still buckling up to understand qkv through family reunions and weighted values for loving cats...

11.0K

Oliver Li Retweeted

Amirhossein Kazemnejad @ ICML@a_kazemnejad · Apr 3

Introducing nanoAhaMoment: Karpathy-style, single file RL for LLM library (<700 lines) - super hackable - no TRL / Verl, no abstraction💆‍♂️ - Single GPU, full param tuning, 3B LLM - Efficient (R1-zero countdown < 10h) comes with a from-scratch, fully spelled out YT video [1/n]

163

1.0K

83.0K

Oliver Li Retweeted

AK@_akhaliq · Apr 4

Deepseek just announced Inference-Time Scaling for Generalist Reward Modeling on Hugging Face show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve…

544

251

49.0K

Oliver Li@oliveraochongli · Oct 28, 2023

Struggle with generating dialogues from limited norm samples? Check our new paper #emnlp2023 Normdial at arxiv.org/abs/2310.14563…. Thanks @mallika_2011 @skychwang @rkdsaakyan @SmaraMuresanNLP

3.0K