Philippe Laban

@PhilippeLaban

Research Scientist @MSFTResearch. NLP/HCI Research.

New York City

Joined April 2022

678Following

1KFollowers

Philippe Laban Retweeted

Maybe don't use an LLM for _everything_? Last summer, I got to fiddle again with content diversity @AdobeResearch @Adobe and we showed that agentic pipelines that mix LLM-prompt steps with principled techniques can yield better, more personalized summaries

6.0K

Philippe Laban@PhilippeLaban · Jul 22

issues w preference LM benchmarks 🐡data contains cases where the "bad" response is just as good as chosen one 🐟model rankings can feel off (claude ranks lower than expected) led by @cmalaviya11 (TACL 2025), we study underspecified queries & detrimental effect on model evals

AAi2@allen_ai · Jul 22

In our new paper, “Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries,” we find that adding just a bit of missing context can reorder model leaderboards—and surface hidden biases. 🧵👇

4.0K

Philippe Laban Retweeted

Salesforce AI Research@SFResearch · Jul 22

🏆 #ICML2025 Best Paper Award: AI Safety Should Prioritize the Future of Work 📄 Paper: arxiv.org/abs/2504.13959 🎉 Congratulations to Sanchaita Hazra @hsanchaita, Bodhisattwa Prasad Majumder @mbodhisattwa, and Tuhin Chakrabarty @TuhinChakr for winning the Outstanding Award —…

2.0K

Philippe Laban Retweeted

Kiran Tomlinson@kiran_tomlinson · Jul 11

New paper from my group at @MSFTResearch! 📄arxiv.org/abs/2507.07935 Promises about how AI will change work are cheap. What does the actual data say? We measured which work activities people use AI for, how successful they are, and which jobs do those tasks. 🧵1/8

2.0K

Philippe Laban@PhilippeLaban · Jul 10

Happy to share that EvalAgent has been accepted to #COLM2025 @COLM_conf 🎉🇨🇦 We introduce a framework to identify implicit and diverse evaluation criteria for various open-ended tasks! 📜 arxiv.org/pdf/2504.15219

MManya Wadhwa@ManyaWadhwa1 · Apr 22

Evaluating language model responses on open-ended tasks is hard! 🤔 We introduce EvalAgent, a framework that identifies nuanced and diverse criteria 📋✍️. EvalAgent identifies 👩‍🏫🎓 expert advice on the web that implicitly address the user’s prompt 🧵👇

5.0K

Philippe Laban@PhilippeLaban · Jul 8

Now accepted to @COLM_conf 🤩 Super excited for Montreal 🇨🇦🍁 This also marks my third successful collaboration with my good friend @PhilippeLaban

TTuhin Chakrabarty@TuhinChakr · Apr 21

Unlike math/code, writing lacks verifiable rewards. So all we get is slop. To solve this we train reward models on expert edits that beat SOTA #LLMs largely on a new Writing Quality benchmark. We also reduce #AI slop by using our RMs at test time boosting alignment with experts.

4.0K

Philippe Laban Retweeted

Omar Shaikh@oshaikh13 · Jun 9

What if LLMs could learn your habits and preferences well enough (across any context!) to anticipate your needs? In a new paper, we present the General User Model (GUM): a model of you built from just your everyday computer use. 🧵

338

198

58.0K

Philippe Laban Retweeted

Myra Cheng @ ACL2025 🎶@chengmyra1 · Jun 4

Do people actually like human-like LLMs? In our #ACL2025 paper HumT DumT, we find a kind of uncanny valley effect: users dislike LLM outputs that are *too human-like*. We thus develop methods to reduce human-likeness without sacrificing performance.

160

16.0K

Philippe Laban Retweeted

Kung-Hsiang Steeve Huang@steeve__huang · May 30

🚨 The Business AI Plot Thickens 🚨 CRMArena set the stage for business AI evaluation in realistic environments. Now we're back with CRMArena-Pro - a major expansion that extends to 19 work tasks across diverse business applications (sales, service, and CPQ processes). It covers…

4.0K

Philippe Laban Retweeted

Lucy Li@lucy3_li · May 30

"Tell, Don't Show" was accepted to #ACL2025 Findings! Our simple approach for literary topic modeling combines the new (language models) with the old (classic LDA) to yield better topics. A possible addition to your CSS/DH research 🛠️ box ✨📚 arxiv.org/abs/2505.23166

127

11.0K

Philippe Laban Retweeted

Sahana Ramnath@sahana_ramnath · May 28

Human-AI conversations are multi-turn, lengthy and complex; something that existing LLM judges do not explicitly accommodate!👩🤖 Excited to share our new preprint on evaluating and improving preference judgment for such conversations!🙌🏽 📑: arxiv.org/pdf/2505.20451 🧵👇 [1/n]

4.0K

Philippe Laban Retweeted

Eric@Eric_chamoun · May 29

What are NLP papers really saying about the purpose and use of their models/datasets? 🤔 Who are they for? What problems do they solve? How are they used? We built a framework + tool to: (1) analyze framing trends across papers (2) help authors reflect on their own framing 🧵

3.0K

Philippe Laban Retweeted

Xiang 'Anthony' Chen@_xiang_chen_ · May 28

We are eagerly looking for ACs for #chi2026 Computational Interaction--one of the most exciting & fastest growing SCs! If you have CHI publication & review experiences on related topics and would like to contribute, please DM or email me!

4.0K

Philippe Laban Retweeted

Liyan Tang@LiyanTang4 · May 20

Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts! ✍🏻Entirely human-written questions by 13 CS researchers 👀Emphasis on visual reasoning – hard to be verbalized via text CoTs 📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B

11.0K

Philippe Laban@PhilippeLaban · May 17

Excited to share that CogAlign is accepted at #ACL2025 Findings! We investigated the "Jagged Intelligence" of VLMs – their surprising difficulty with basic visual arithmetics (e.g., counting objects, measuring angles) compared to their strong performance on harder visual tasks.…

KKung-Hsiang Steeve Huang@steeve__huang · Feb 18

Vision Language Models (VLMs) are great at many things, but they often fumble when it comes to simple visual arithmetics like counting or comparing lengths, hindering their understanding of charts 📈 and geometry 📐. Our new paper explores why this happens 🧐 and discover the…

5.0K

Philippe Laban Retweeted

Philipp Schmid@_philschmid · May 16

Do LLMs struggle in long, multi-turn conversations? Yes, they do, performance degrades in multi-turn conversations, due to a increase in unreliability. A new paper found an drop of 39% in multi-turn settings, showing LLMs make premature assumptions and struggle to recover from…

175

113

17.0K

Philippe Laban Retweeted

Hiroaki_Hayashi@hiroakiLhayashi · May 15

New paper! 🧵1/3 LLMs earn high marks with fully-specified, single-turn prompts, yet real users reveal needs over time. We built "sharded simulation" that reveals information progressively. Performance plummets from 90 % ➡️ 65 %—even in 2-turn chats—on ALL 15 LLMs we tested.

643

Philippe Laban Retweeted

Rohan Paul@rohanpaul_ai · May 15

Why LLMs Fail in Back-and-Forth Chats Beautiful paper from @microsoft and @SFResearch Large Language Models (LLMs) are incredibly good at tackling tasks when you give them all the information upfront in one go. Think of asking for a code snippet with all requirements clearly…

260

234

23.0K

Philippe Laban Retweeted

Erik Nijkamp@erik_nijkamp · May 14

🔥 xGen-Small – open-source small LMs with long context! 📏 Context: 128K-token documents 🧮 Math: 95.3 % GSM8K · 91.6 % MATH · 50 % AIME’24 💻 Code: 50.6 % LiveCodeBench 🔗 Models: huggingface.co/Salesforce/xge… 📄 Paper: arxiv.org/abs/2505.06496 📝 Blog: salesforce.com/blog/xgen-smal…

16.0K

Philippe Laban@PhilippeLaban · May 14

Thanks for sharing our work @omarsar0 ... and doing a better job than me at explaining our findings to a broad audience ^^'

eelvis@omarsar0 · May 14

LLMs Get Lost in Multi-turn Conversation The cat is out of the bag. Pay attention, devs. This is one of the most common issues when building with LLMs today. Glad there is now paper to share insights. Here are my notes:

752