Yuqing Yang
@yyqcode
First-year PhD student @CSatUSC @nlp_usc.
🧐When do LLMs admit their mistakes when they should know better? In our new paper, we define this behavior as retraction: the model indicates that its generated answer was wrong. LLMs can retract—but they rarely do.🤯 arxiv.org/abs/2505.16170 👇🧵

Billion-parameter LLMs still struggle with simple arithmetic? 📞 FoNE (Fourier Number Embedding) tackles this problem. By mapping numbers directly into Fourier space, it bypasses tokenization and significantly improves numerical accuracy with better efficiency and accuracy.
# 🚨 4B open-recipe model beats Claude-4-Opus 🔓 100% open data, recipe, model weights and code. Introducing Polaris✨--a post-training recipe for scaling RL on advanced reasoning models. 🥳 Check out how we boost open-recipe reasoning models to incredible performance levels…
There’s been hot debate about (The Illusion of) The Illusion of Thinking. My take: it’s not that models can’t reason — they just aren’t perfect at long-form generation yet. We eval reasoning models on LongProc benchmark (requiring generating 8K CoTs, see thread). Reasoning…
🤔Now most LLMs have >= 128K context sizes, but are they good at generating long outputs, such as writing 8K token chain-of-thought for a planning problem? 🔔Introducing LongProc (Long Procedural Generation), a new benchmark with 6 diverse tasks that challenge LLMs to synthesize…
🧵 Recent studies show LLMs can self-improve their responses when given external feedback. But how effectively can they incorporate it? We tested this systematically—and found they can't fully integrate feedback, even when the feedback is high-quality and backed by ground-truth.
🚨 We discovered a surprising side effect of Reinforcement Finetuning (RFT): it makes LLMs more confidently wrong on unanswerable questions. We call this the hallucination tax: a drop in refusal behavior that leads to overconfident hallucinations. 🧵 1/n
Textual steering vectors can improve visual understanding in multimodal LLMs! You can extract steering vectors via any interpretability toolkit you like -- SAEs, MeanShift, Probes -- and apply them to image or text tokens (or both) of Multimodal LLMs. And They Steer!
Want to know what your LLM don’t know? This is how 👇 Preprint: arxiv.org/abs/2503.23361 Code: github.com/uscnlp-lime/SEA
Running your model on multiple GPUs but often found the speed not satisfiable? We introduce Ladder-residual, a parallelism-aware architecture modification that makes 70B Llama with tensor parallelism ~30% faster! Work done at @togethercompute. Co-1st author with @MayankMish98…
Come join the #NeurIPS2024 poster session and discuss whether language models can learn to skip steps in reasoning! 🗓Dec 12, Thursday, 11:00 am - 2:00 pm 📍East Exhibit Hall A-C #2900 Feel free to stop by and say hi! I am actively seeking Summer 2025 internship opportunities!
🤔Can LMs learn to skip steps to improve reasoning efficiency while maintaining accuracy? ✅The answer is Yes! In our #NeurIPS 2024 work, we show this behavior boosts efficiency, maintains accuracy, and even enhances generalization in OOD scenarios! 🚀arxiv.org/pdf/2411.01855 🧵⬇️
I'll present a poster for Lifelong ICL and Task Haystack at #NeurIPS2024! ⏰ Wednesday 11am-2pm 📍 East Exhibit Hall A-C #2802 📜 arxiv.org/abs/2407.16695 My co-first author @xiaoyue02_xu is applying to PhD programs and I am looking jobs in industry! Happy to connect at NeurIPS!
Introducing 𝗟𝗶𝗳𝗲𝗹𝗼𝗻𝗴 𝗜𝗖𝗟 and 𝗧𝗮𝘀𝗸 𝗛𝗮𝘆𝘀𝘁𝗮𝗰𝗸, a new approach for evaluating long-context LMs, featuring ever-changing task streams that controllably fill the context window, and NIAH-style visualization for easy diagnosis. 📜 arxiv.org/abs/2407.16695 🧵