Ben Cohen-Wang
@bcohenwang
Machine learning PhD student at MIT advised by Aleksander Madry
Popular reasoning benchmarks just reward correct answers (they don't penalize guessing). This incentivizes models that guess when they're not sure which (beyond hurting usability) seems like it would encourage hallucinations more broadly. Is this why o3 etc. hallucinate a lot?
Increasingly, LLMs cite sources for claims they make, but are the sources they cite actually what they are using? In work led by @YungSungChuang, we design a reward to quantify this, and use this reward to (automatically) improve citation quality! 🧵
(1/5)🚨LLMs can now self-improve to generate better citations✅ 📝We design automatic rewards to assess citation quality 🤖Enable BoN/SimPO w/o external supervision 📈Perform close to “Claude Citations” API w/ only 8B model 📄arxiv.org/abs/2502.09604 🧑💻github.com/voidism/SelfCi…
We introduce ContextCite, a tool that can help us understand when and how an LLM uses in-context information! w/ @harshays_, @kris_georgiev1, @aleks_madry Check out our demo: huggingface.co/spaces/context… Thread ⤵️
How is an LLM actually using the info given to it in its context? Is it misinterpreting anything or making things up? Introducing ContextCite: a simple method for attributing LLM responses back to the context: gradientscience.org/contextcite w/ @bcohenwang, @harshays_, @kris_georgiev1
Models often fail under distribution shifts—can pre-training on a large and diverse dataset and then fine-tuning on a task-specific dataset help? W/ @bcohenwang, @josh_vendrow we show that this depends on the specific failure mode. In particular, pre-training can help with…
Will your model identify a polar bear on the moon? How would you know? Dataset Interfaces let you generate images from your dataset under whatever distribution shift you desire! arxiv.org/abs/2302.07865 gradientscience.org/dataset-interf… W/ @josh_vendrow @saachi_jain_ @logan_engstrom
Our paper on immunizing images to diffusion model-powered malicious manipulation is out arxiv.org/abs/2302.06588! This approach, combined with policy incentives, aims to raise the cost of such unauthorized image editing. w/ @hadisalmanX @Alaa_Khaddaj @gpoleclerc @andrew_ilyas