Marzena Karpinska
@mar_kar_
nlp evaluation of long-form input/output, mt/multilingual nlp, creative text generation 🇵🇱 ➯ 🇯🇵 ➯ 🇺🇸 Former: Postdoc @UMASS_NLP
Can #LLMs truly reason over loooong context? 🤔 NoCha asks LLMs to verify claims about *NEW* fictional books 🪄 📚 ⛔ LLMs that solve needle-in-the-haystack (~100%) struggle on NoCha! ⛔ None of 11 tested LLMs reach human performance → 97%. The best, #GPT-4o, gets only 55.8%.

A bit late to announce, but I’m excited to share that I'll be starting as an assistant professor at the University of Maryland @umdcs this August. I'll be recruiting PhD students this upcoming cycle for fall 2026. (And if you're a UMD grad student, sign up for my fall seminar!)
Don't sleep on this opportunity! This is an amazing advisor :)
Life update: I’m excited to share that I’ll be starting as faculty at the Max Planck Institute for Software Systems(@mpi_sws_) this Fall!🎉 I’ll be recruiting PhD students in the upcoming cycle, as well as research interns throughout the year: lasharavichander.github.io/contact.html
ChatGPT Agent is a huge step up on BearCubs, esp on multimodal/interactive tasks (e.g., playing web games)! It gets 65.8% accuracy vs Deep Research's 36% and Operator's 23%. Humans are at ~85%, and clearly better/faster at fine control & complex filtering.
Introducing 🐻 BEARCUBS 🐻, a “small but mighty” dataset of 111 QA pairs designed to assess computer-using web agents in multimodal interactions on the live web! ✅ Humans achieve 85% accuracy ❌ OpenAI Operator: 24% ❌ Anthropic Computer Use: 14% ❌ Convergence AI Proxy: 13%
Life update: I’m excited to share that I’ll be starting as faculty at the Max Planck Institute for Software Systems(@mpi_sws_) this Fall!🎉 I’ll be recruiting PhD students in the upcoming cycle, as well as research interns throughout the year: lasharavichander.github.io/contact.html
🎉 Excited to be at #ICML2025 in 🇨🇦! 📍 Catch me Saturday at the #Memorization workshop, I’ll be presenting OWL! arxiv.org/abs/2505.22945
Excited to talk about long-context models / eval at this panel on Saturday! I'm also looking for a postdoc / PhD students to work on related topics, happy to chat with anyone interested at #ICML2025!
💡 Curious about long-context foundation models (LFCM)? 🧠 We’re hosting a panel at the LCFM workshop at #ICML2025 on “How to evaluate long-context foundation models?” — We’d love to feature your question! Anything on long-context evaluation or modeling — drop it below / DM me🎤
𝐖𝐡𝐚𝐭 𝐇𝐚𝐬 𝐁𝐞𝐞𝐧 𝐋𝐨𝐬𝐭 𝐖𝐢𝐭𝐡 𝐒𝐲𝐧𝐭𝐡𝐞𝐭𝐢𝐜 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧? I'm happy to announce that the preprint release of my first project is online! Developed with the amazing support of @lasha_nlp and @anmarasovic (Full link below 👇)
Periodic reminder that we should not engage in the farce of meaningless benchmarks and bad evaluation. 🤷🏽♂️
Kimi-K2 just took top spot on both EQ-Bench3 and Creative Writing! Another win for open models. Incredible job @Kimi_Moonshot
The opportunity gap in AI is more striking than ever. We talk way too much about those receiving $100M or whatever for their jobs, but not enough those asking for <$1k to present their work. For 3rd year in a row, @ml_collective is raising funds to support @DeepIndaba attendees.
Human Centered Evals : 1 Benchmarks : 0
We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.
ONERULER has been accepted to #COLM2025! I'm thrilled to be back for the second conference. This time, we explore how long-context tasks perform in multilingual settings. Excited to be in Montreal this October🍁 + Catch an early peek at @gem_workshop ACL!
Is the needle-in-a-haystack test still meaningful given the giant green heatmaps in modern LLM papers? We create ONERULER 💍, a multilingual long-context benchmark that allows for nonexistent needles. Turns out NIAH isn't so easy after all! Our analysis across 26 languages 🧵👇
CLIPPER has been accepted to #COLM2025! In this work, we introduce a compression-based pipeline to generate synthetic data for long-context narrative reasoning tasks. Excited to be in Montreal this October🍁
⚠️ Current methods for generating instruction-following data fall short for long-range reasoning tasks like narrative claim verification. We present CLIPPER✂️, a compression-based pipeline that produces grounded instructions for ~$0.5 each, 34x cheaper than human annotations.
Lots of reasons to hate #GenAI and feel fatigue. My biggest source of disappointment is how #GenAI has ruined crowdsourcing platforms. Unless you hire experts the human data you think you are getting is just AI. Human centered / behavioral researchers be careful.
🤔 We know what people are using LLMs for, but do we know how they collaborate with an LLM? 🔍 In a recent paper we answered this by analyzing multi-turn sessions in 21 million Microsoft Copilot for consumers and WildChat interaction logs: arxiv.org/abs/2505.16023
🚀 Tower+: our latest model in the Tower family — sets a new standard for open-weight multilingual models! We show how to go beyond sentence-level translation, striking a balance between translation quality and general multilingual capabilities. 1/5 arxiv.org/pdf/2506.17080
Dear ACL community, We are seeking emergency reviewers for the May cycle. Please indicate your availability (ASAP) if you can help review extra papers urgently (by the 24th of June AOE). Many thanks!
Update on reviews for EMNLP, which were due on June 18th: as of today, only about 80% of reviews have been completed 😧 As a reminder, reviewers deemed "highly irresponsible" will not be able to commit their work to EMNLP, as per the updated ARR policy: aclrollingreview.org/incentives2025
NLP×サッカー⚽️、NLP×言語学、LLMなどで、インターン/RA/ポスドク/研究員の話ありますので興味ある方いらっしゃいましたら、ご一報を. aistairc.github.io/plu/people.htm…