Valentina Pyatkin
@valentina__py
Postdoc at the Allen Institute for AI @allen_ai and @uwnlp
Heading to Vienna tomorrow for #ACL2025! Monday: Will Sheffield will be presenting his work on just the sneaky little discourse particle called JUST 😁 Just stop by Hall X4/X5 6-7:30pm CEST! Paper aclanthology.org/2025.findings-… w/ @kanishkamisra @valentina__py Ashwini Deo @kmahowald
And I'm giving a talk at the @l2m2_workshop on Distributional Memorization, next Friday! Curious what's that all about? Make sure to attend the workshop!
I'll be at #ACL2025 next week! Catch me at the poster sessions, eating sachertorte, schnitzel and speaking about distributional memorization at the @l2m2_workshop
why is vancouver sushi so good? 🤤 (vancouver food in general actually)
This week is #ICML in Vancouver, and a number of our researchers are participating. Here's the full list of Ai2's conference engagements—we look forward to connecting with fellow attendees. 👋
The bottleneck in AI isn't just compute - it's access to diverse, high-quality data, much of which is locked away due to privacy, legal, or competitive concerns. What if there was a way to train better models collaboratively, without actually sharing your data? Introducing…
Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵
In Geneva🇨🇭to attend the International Open-Source LLM Builders Summit and present OLMo and Tülu!
This new benchmark created by @valentina__py should be the new default replacing IFEval. Some of the best frontier models get <50% and it comes with separate training prompts so people don’t effectively train on test. Wild gap from o3 to Gemini 2.5 pro of like 30 points.
Introducing IFBench, a benchmark to measure how well AI models follow new, challenging, and diverse verifiable instructions. Top models like Gemini 2.5 Pro or Claude 4 Sonnet are only able to score up to 50%, presenting an open frontier for post-training. 🧵
Introducing IFBench, a benchmark to measure how well AI models follow new, challenging, and diverse verifiable instructions. Top models like Gemini 2.5 Pro or Claude 4 Sonnet are only able to score up to 50%, presenting an open frontier for post-training. 🧵