Ai2
@allen_ai
Breakthrough AI to solve the world's biggest problems. › Join us: https://allenai.org/careers › Newsletter: https://tinyurl.com/3vc2r2m8
Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵
issues w preference LM benchmarks 🐡data contains cases where the "bad" response is just as good as chosen one 🐟model rankings can feel off (claude ranks lower than expected) led by @cmalaviya11 (TACL 2025), we study underspecified queries & detrimental effect on model evals
In our new paper, “Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries,” we find that adding just a bit of missing context can reorder model leaderboards—and surface hidden biases. 🧵👇
Excited to share what I have been focusing on this year! Inference-time search to optimize Bayesian surprise pushes us towards long-horizon discovery! Introducing "AutoDS": Autonomous Discovery via Surprisal. "It can not only find the diamond in the rough, but also can rule out…
Great science starts with great questions. 🤔✨ Meet AutoDS—an AI that doesn’t just hunt for answers, it decides which questions are worth asking. 🧵
A new model enters SciArena. 👀 Welcome Moonshot AI's Kimi K2! SciArena lets you benchmark models across scientific literature tasks, applying a crowdsourced LLM evaluation approach to the scientific domain. 🧪 Learn more and try SciArena here: sciarena.allen.ai

You can now jump from Scholar QA answers to highlighted evidence in the source paper's pdf : )
We’ve upgraded ScholarQA, our agent that helps researchers conduct literature reviews efficiently by providing detailed answers. Now, when ScholarQA cites a source, it won’t just tell you which paper it came from–you’ll see the exact quote, highlighted in the original PDF. 🧵