Melanie Sclar
@melaniesclar
PhD student @uwnlp @uwcse | Visiting Researcher @AIatMeta FAIR | Prev. Lead ML Engineer @asapp, intern @LTIatCMU | 🇦🇷
Did you know that depending on the format used in few-shot prompting, you may get accuracies ranging 4%-88% for a given task w/LLaMA-2-70B 5-shot? or 47%-85% w/GPT3.5?🤯 We explore this variance in FormatSpread, or: How I learned to start worrying about prompt formatting. 1/n

Check out our work on preference modeling through latent (& interpretable) attribute representation learning! PrefPalette allows you to understand _why_ something is preferred and _how_ preference varies depending on context 🎨
WHY do you prefer something over another? Reward models treat preference as a black-box😶🌫️but human brains🧠decompose decisions into hidden attributes We built the first system to mirror how people really make decisions in our #COLM2025 paper🎨PrefPalette✨ Why it matters👉🏻🧵
WHY do you prefer something over another? Reward models treat preference as a black-box😶🌫️but human brains🧠decompose decisions into hidden attributes We built the first system to mirror how people really make decisions in our #COLM2025 paper🎨PrefPalette✨ Why it matters👉🏻🧵
🎉 We’re excited to introduce BLAB: Brutally Long Audio Bench, the first benchmark for evaluating long-form reasoning in audio LMs across 8 challenging tasks, using 833+ hours of Creative Commons audio. (avg length: 51 minutes).
Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data…
Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵
Web data, the “fossil fuel of AI”, is being exhausted. What’s next?🤔 We propose Recycling the Web to break the data wall of pretraining via grounded synthetic data. It is more effective than standard data filtering methods, even with multi-epoch repeats! arxiv.org/abs/2506.04689
Thrilled to announce that I will be joining @UTAustin @UTCompSci as an assistant professor in fall 2026! I will continue working on language models, data challenges, learning paradigms, & AI for innovation. Looking forward to teaming up with new students & colleagues! 🤠🤘
🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…
Dear NYC friends: just got here and will be around until Thu!
📢I'm thrilled to announce that I’ll be joining @KAIST_AI as an Assistant Professor in 2026, leading the Computation & Cognition (COCO) Lab🤖🧠: coco-kaist.github.io We'll be exploring reasoning, learning w/ synthetic data, and social agents! +I'm spending a gap year @nvidia✨
Excited to announce our workshop on Visions of Language Modeling at COLM'25! 🔥 We thought that current LM research overly focuses on a narrow set of popular topics (e.g., test-time scaling and LLM agents), and we'd love to bring some entropy back 💪 To do this, we invited a…
We begin our speaker spotlights with Alane Suhr (@alsuhr), Assistant Professor at UC Berkeley and an invited speaker at the Workshop on Computer Use Agents at @icmlconf 2025! Her research centers on building systems that use language to interact with people, enabling agents to…
Still around at #NAACL2025 ? I will be presenting a poster for the work 👇at the Workshop on Narrative Understanding in Tesuque, Albuquerque Convention Center from 2:30 pm. Please stop by if interested. Here is the poster, designed by the amazing @advaitmb.
📢 New Paper! Tired 😴 of reasoning benchmarks full of math & code? In our work we consider the problem of reasoning for plot holes in stories -- inconsistencies in a storyline that break the internal logic or rules of a story’s world 🌎 W/ @melaniesclar, and @tsvetshop 1/n
Now @abertsch72 is talking about in context learning with long context models! arxiv.org/abs/2405.00200
With the rise of R1, search seems out of fashion? We prove the opposite! 😎 Introducing Retro-Search 🌈: an MCTS-inspired search algorithm that RETROspectively revises R1’s reasoning traces to synthesize untaken, new reasoning paths that are better 💡, yet shorter in length ⚡️.
Will be in Berkeley for the weekend, and off to #ICLR2025 in Singapore on Monday night to present CreativityIndex and ExploreToM! Please reach out if you'd like to meet: these days I'm most excited about reliable synthetic data generation for reasoning in ¬(math & code) domains
See our work on procedurally generating challenging reasoning problems on detecting inconsistencies in stories! FlawedFictions is a great example what I'm most excited about: reliable synthetic data for reasoning in under-explored domains. (I'll be at ICLR to chat, DMs open!)
📢 New Paper! Tired 😴 of reasoning benchmarks full of math & code? In our work we consider the problem of reasoning for plot holes in stories -- inconsistencies in a storyline that break the internal logic or rules of a story’s world 🌎 W/ @melaniesclar, and @tsvetshop 1/n