Taiwei Shi
@taiwei_shi
AI Researcher & Ph.D. student @nlp_usc. Intern @MSFTResearch. Formerly @GeorgiaTech @USC_ISI. NLP & Computational Social Science.
Want to ๐๐ฎ๐ญ ๐๐ ๐ ๐ญ๐ซ๐๐ข๐ง๐ข๐ง๐ ๐ญ๐ข๐ฆ๐ ๐๐ฒ ๐ฎ๐ฉ ๐ญ๐จ ๐ร and boost performance? ๐ Meet ๐จ๐ ๐๐น๐ญ๐ป โ a lightweight, plug-and-play curriculum learning method you can drop into any mainstream RFT algorithms (PPO, GRPO, REINFORCE). Less compute. Better results. ๐งต 1/n

Landed in Vienna for #ACL2025! We are hiring FTEs/Postdocs/Interns at Office of Applied Research to push the frontier of continuous model improvement for productivity, through RL*, inference time scaling, self reflection, memory etc. Available to chat this week w/ @mengtingwan.
CoT transformed text reasoning. What about multimodal? ๐ค Check out our new dataset of interleaved text and image reasoning traces. We also show interesting visual CoT examples generated inherently by the model finetuned on our dataset!
๐จAnnouncing Zebra-CoT, a large-scale dataset of high quality interleaved image-text reasoning traces ๐. Humans often draw visual aids like diagrams when solving problems, but existing VLMs reason mostly in pure text. 1/n
OpenAI over the years: 2022: Publishes papers in top-tier conferences 2023: Releases technical reports on arXiv 2024: Posts random blogs on its website 2025: "TRUST ME BRO!"
We achieved gold medal-level performance ๐ฅon the 2025 International Mathematical Olympiad with a general-purpose reasoning LLM! Our model solved world-class math problemsโat the level of top human contestants. A major milestone for AI and mathematics.
Are you a researcher, trying to build a small GPU cluster? Did you already build one, and it sucks? I manage USC NLPโs GPU cluster and Iโm happy to offer my expertise. I hope I can save you some headaches and make some friends. Please reach out!
Our paper on ๐๐ญ๐จ๐๐ก๐๐ฌ๐ญ๐ข๐ ๐๐ซ๐ซ๐จ๐ซ ๐๐ฌ๐๐๐ง๐ญ (๐๐๐) has been accepted to #COLM2025! ๐ We introduce a scalable framework for uncovering LLM knowledge gaps with remarkable efficiency. Read more ๐ ๐ Paper: arxiv.org/abs/2503.23361 ๐ป Code: github.com/limenlp/SEA
Want to know what your LLM donโt know? This is how ๐ Preprint: arxiv.org/abs/2503.23361 Code: github.com/uscnlp-lime/SEA
Happy to have contributed to this research that brings replacing me as a researcher one step closer. ๐
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
Glad to see multiple efforts highlighting the challenge of LLMs hallucinating on unanswerable math problems and the importance of abstention. Just a quick correctionโwe're from USC ๐.
Our results also align with concurrent work from UCLA @linxins2 @taiwei_shi @jieyuzhao11 which also observed reasoning LLMs hallucinate on unanswerable math problems! x.com/linxins2/statuโฆ More evidence to argue that hallucination and failure to abstain is a big challenge inโฆ
๐งต Recent studies show LLMs can self-improve their responses when given external feedback. But how effectively can they incorporate it? We tested this systematicallyโand found they can't fully integrate feedback, even when the feedback is high-quality and backed by ground-truth.
What if LLMs could learn your habits and preferences well enough (across any context!) to anticipate your needs? In a new paper, we present the General User Model (GUM): a model of you built from just your everyday computer use. ๐งต
Teaching AI to Say โI Donโt Knowโ: A New Dataset Mitigates Hallucinations from Reinforcement Finetuning Researchers from the University of Southern California developed the Synthetic Unanswerable Math (SUM) dataset. SUM introduces implicitly unanswerable math problems byโฆ
Can your LLM truly understand and adapt to ๐๐๐๐ โ๐๐๐ ๐๐๐๐๐ข๐๐๐ก๐ฆ ๐๐๐๐๐ ? Introducing ๐๐๐๐๐-๐๐๐๐๐ ๐งญ โ a large-scale benchmark to test ๐ฌ๐ญ๐๐๐ซ๐๐๐ข๐ฅ๐ข๐ญ๐ฒ across 30 highly contrasting online communities.
๐คHow well do LLMs adapt to different norms? ๐งตWe introduce STEER-BENCH, a benchmark for assessing steerability in LLMs. ๐ Human: 81% | Top LLM: ~65% ๐จ Norm alignment โ solved. ๐ Paper: arxiv.org/abs/2505.20645 @ZihaoHe95 @taiwei_shi @KristinaLerman
Is there anything that Qwen cannot do at this point? ๐

Excited to share that Iโll be interning @Microsoft Office of Applied Research this summer, working on reinforcement finetuning with the awesome @soshsihao and @ylongqi. Seattle friends, letโs catch up and chat anything from alignment to inference-time scaling!

Now accepted by #ACL2025! Thrilled to see our paper also referenced in @lilianweng's latest blog post on reasoning in LLMs! Check it out: lilianweng.github.io/posts/2025-05-โฆ
Process supervision for reasoning is ๐ฅ! While previous approaches often relied on human annotation and struggled to generalize across different reasoning tasks, we're now asking: Can we improve this? Introducing ๐๐๐๐๐๐๐๐๐๐๐: a new model pre-trained on implicitโฆ