Fangcong Yin
@fangcong_y10593
CS PhD Student @UTAustin studying NLP. Prev: @CornellCIS
Solving complex problems with CoT requires combining different skills. We can do this by: 🧩Modify the CoT data format to be “composable” with other skills 🔥Train models on each skill 📌Combine those models Lead to better 0-shot reasoning on tasks involving skill composition!


What if you could understand and control an LLM by studying its *smaller* sibling? Our new paper proposes the Linear Representation Transferability Hypothesis: internal representations of different-sized models can be translated via a simple linear (affine) map.
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
There’s been hot debate about (The Illusion of) The Illusion of Thinking. My take: it’s not that models can’t reason — they just aren’t perfect at long-form generation yet. We eval reasoning models on LongProc benchmark (requiring generating 8K CoTs, see thread). Reasoning…
🤔Now most LLMs have >= 128K context sizes, but are they good at generating long outputs, such as writing 8K token chain-of-thought for a planning problem? 🔔Introducing LongProc (Long Procedural Generation), a new benchmark with 6 diverse tasks that challenge LLMs to synthesize…
🧵 Recent studies show LLMs can self-improve their responses when given external feedback. But how effectively can they incorporate it? We tested this systematically—and found they can't fully integrate feedback, even when the feedback is high-quality and backed by ground-truth.
LLMs trained to memorize new facts can’t use those facts well.🤔 We apply a hypernetwork to ✏️edit✏️ the gradients for fact propagation, improving accuracy by 2x on a challenging subset of RippleEdit!💡 Our approach, PropMEND, extends MEND with a new objective for propagation.
Check out our new work on query-focused retrieval heads of LLMs! It is cool to see how interpretability insights can be used to improve zero-shot reasoning and re-ranking over long context.
🤔 Recent mech interp work showed that retrieval heads can explain some long-context behavior. But can we use this insight for retrieval? 📣 Introducing QRHeads (query-focused retrieval heads) that enhance retrieval Main contributions: 🔍 Better head detection: we find a…
Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts! ✍🏻Entirely human-written questions by 13 CS researchers 👀Emphasis on visual reasoning – hard to be verbalized via text CoTs 📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B
Check out my work at @bespokelabsai We release Bespoke-MiniChart-7B, a new SOTA in chart understanding of its size Chart understanding is really fun and challenging and requires reasoning skills beyond math reasoning It's a great starting point for open chart model development!
Announcing Bespoke-MiniChart-7B, a new SOTA in chart understanding for models of comparable size on seven benchmarks, on par with Gemini-1.5-Pro and Claude-3.5! 🚀 Beyond its real-world applications, chart understanding is a good challenging problem for VLMs, since it requires…
Evaluating language model responses on open-ended tasks is hard! 🤔 We introduce EvalAgent, a framework that identifies nuanced and diverse criteria 📋✍️. EvalAgent identifies 👩🏫🎓 expert advice on the web that implicitly address the user’s prompt 🧵👇
Your long context model might be good at understanding long inputs, but can they generate long outputs? Check out our new benchmark on Long Procedural Generation!
🤔Now most LLMs have >= 128K context sizes, but are they good at generating long outputs, such as writing 8K token chain-of-thought for a planning problem? 🔔Introducing LongProc (Long Procedural Generation), a new benchmark with 6 diverse tasks that challenge LLMs to synthesize…
Interesting perspective, thanks for sharing! As one of the authors of the “CoT mainly helps on math/logic paper”, I agree with a lot of this, especially the connection to generator/validator gaps. One of our aims going into this project was to find datasets beyond math/logic…
An underrated but occasionally make-or-break skill in AI research (that didn’t really exist ten years ago) is the ability to find a dataset that actually exercises a new method you are working on. Back in the day when the bottleneck in AI was learning, many methods were…
I'm shocked to see racism happening in academia again, at the best AI conference @NeurIPSConf. Targeting specific ethnic groups to describe misconduct is inappropriate and unacceptable. @NeurIPSConf must take a stand. We call on Rosalind Picard @MIT @medialab to retract and…