Philippe Laban
@PhilippeLaban
Research Scientist @MSFTResearch. NLP/HCI Research.
Maybe don't use an LLM for _everything_? Last summer, I got to fiddle again with content diversity @AdobeResearch @Adobe and we showed that agentic pipelines that mix LLM-prompt steps with principled techniques can yield better, more personalized summaries
issues w preference LM benchmarks 🐡data contains cases where the "bad" response is just as good as chosen one 🐟model rankings can feel off (claude ranks lower than expected) led by @cmalaviya11 (TACL 2025), we study underspecified queries & detrimental effect on model evals
In our new paper, “Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries,” we find that adding just a bit of missing context can reorder model leaderboards—and surface hidden biases. 🧵👇
🏆 #ICML2025 Best Paper Award: AI Safety Should Prioritize the Future of Work 📄 Paper: arxiv.org/abs/2504.13959 🎉 Congratulations to Sanchaita Hazra @hsanchaita, Bodhisattwa Prasad Majumder @mbodhisattwa, and Tuhin Chakrabarty @TuhinChakr for winning the Outstanding Award —…
New paper from my group at @MSFTResearch! 📄arxiv.org/abs/2507.07935 Promises about how AI will change work are cheap. What does the actual data say? We measured which work activities people use AI for, how successful they are, and which jobs do those tasks. 🧵1/8
Happy to share that EvalAgent has been accepted to #COLM2025 @COLM_conf 🎉🇨🇦 We introduce a framework to identify implicit and diverse evaluation criteria for various open-ended tasks! 📜 arxiv.org/pdf/2504.15219
Evaluating language model responses on open-ended tasks is hard! 🤔 We introduce EvalAgent, a framework that identifies nuanced and diverse criteria 📋✍️. EvalAgent identifies 👩🏫🎓 expert advice on the web that implicitly address the user’s prompt 🧵👇
Now accepted to @COLM_conf 🤩 Super excited for Montreal 🇨🇦🍁 This also marks my third successful collaboration with my good friend @PhilippeLaban
Unlike math/code, writing lacks verifiable rewards. So all we get is slop. To solve this we train reward models on expert edits that beat SOTA #LLMs largely on a new Writing Quality benchmark. We also reduce #AI slop by using our RMs at test time boosting alignment with experts.
What if LLMs could learn your habits and preferences well enough (across any context!) to anticipate your needs? In a new paper, we present the General User Model (GUM): a model of you built from just your everyday computer use. 🧵
Do people actually like human-like LLMs? In our #ACL2025 paper HumT DumT, we find a kind of uncanny valley effect: users dislike LLM outputs that are *too human-like*. We thus develop methods to reduce human-likeness without sacrificing performance.
🚨 The Business AI Plot Thickens 🚨 CRMArena set the stage for business AI evaluation in realistic environments. Now we're back with CRMArena-Pro - a major expansion that extends to 19 work tasks across diverse business applications (sales, service, and CPQ processes). It covers…
"Tell, Don't Show" was accepted to #ACL2025 Findings! Our simple approach for literary topic modeling combines the new (language models) with the old (classic LDA) to yield better topics. A possible addition to your CSS/DH research 🛠️ box ✨📚 arxiv.org/abs/2505.23166
Human-AI conversations are multi-turn, lengthy and complex; something that existing LLM judges do not explicitly accommodate!👩🤖 Excited to share our new preprint on evaluating and improving preference judgment for such conversations!🙌🏽 📑: arxiv.org/pdf/2505.20451 🧵👇 [1/n]
What are NLP papers really saying about the purpose and use of their models/datasets? 🤔 Who are they for? What problems do they solve? How are they used? We built a framework + tool to: (1) analyze framing trends across papers (2) help authors reflect on their own framing 🧵
We are eagerly looking for ACs for #chi2026 Computational Interaction--one of the most exciting & fastest growing SCs! If you have CHI publication & review experiences on related topics and would like to contribute, please DM or email me!
Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts! ✍🏻Entirely human-written questions by 13 CS researchers 👀Emphasis on visual reasoning – hard to be verbalized via text CoTs 📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B
Excited to share that CogAlign is accepted at #ACL2025 Findings! We investigated the "Jagged Intelligence" of VLMs – their surprising difficulty with basic visual arithmetics (e.g., counting objects, measuring angles) compared to their strong performance on harder visual tasks.…
Vision Language Models (VLMs) are great at many things, but they often fumble when it comes to simple visual arithmetics like counting or comparing lengths, hindering their understanding of charts 📈 and geometry 📐. Our new paper explores why this happens 🧐 and discover the…
Do LLMs struggle in long, multi-turn conversations? Yes, they do, performance degrades in multi-turn conversations, due to a increase in unreliability. A new paper found an drop of 39% in multi-turn settings, showing LLMs make premature assumptions and struggle to recover from…
New paper! 🧵1/3 LLMs earn high marks with fully-specified, single-turn prompts, yet real users reveal needs over time. We built "sharded simulation" that reveals information progressively. Performance plummets from 90 % ➡️ 65 %—even in 2-turn chats—on ALL 15 LLMs we tested.
Why LLMs Fail in Back-and-Forth Chats Beautiful paper from @microsoft and @SFResearch Large Language Models (LLMs) are incredibly good at tackling tasks when you give them all the information upfront in one go. Think of asking for a code snippet with all requirements clearly…
🔥 xGen-Small – open-source small LMs with long context! 📏 Context: 128K-token documents 🧮 Math: 95.3 % GSM8K · 91.6 % MATH · 50 % AIME’24 💻 Code: 50.6 % LiveCodeBench 🔗 Models: huggingface.co/Salesforce/xge… 📄 Paper: arxiv.org/abs/2505.06496 📝 Blog: salesforce.com/blog/xgen-smal…
Thanks for sharing our work @omarsar0 ... and doing a better job than me at explaining our findings to a broad audience ^^'
LLMs Get Lost in Multi-turn Conversation The cat is out of the bag. Pay attention, devs. This is one of the most common issues when building with LLMs today. Glad there is now paper to share insights. Here are my notes: