Eliya Habba @ ACL 2025 ๐ฆ๐น
@EliyaHabba
PhD student at @HebrewU #NLP
Care about LLM evaluation? ๐ค ๐ค We bring you๐๏ธ DOVE a massive (250M!) collection of LLMs outputs On different prompts, domains, tokens, models... Join our community effort to expand it with YOUR model predictions & become a co-author!
๐จNew paper alert๐จ ๐ง Instruction-tuned LLMs show amplified cognitive biases โ but are these new behaviors, or pretraining ghosts resurfacing? Excited to share our new paper, accepted to CoLM 2025๐! See thread below ๐ #BiasInAI #LLMs #MachineLearning #NLProc
๐ Technical practitioners & grads โ join to build an LLM evaluation hub! Infra Goals: ๐ง Share evaluation outputs & params ๐ Query results across experiments Perfect for ๐งฐ hands-on folks ready to build tools the whole community can use Join the EvalEval Coalition here ๐
๐จ New paper! We present CHIMERA โ a KB of 28K+ scientific idea recombinations ๐ก It captures how researchers blend concepts or take inspiration across fields, enabling: 1. Meta-science 2. Training models to predict new combos noy-sternlicht.github.io/CHIMERA-Web ๐ Findings & data:
The longer reasoning LLM thinks - the more likely to be correct, right? Apparently not. Presenting our paper: โDonโt Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoningโ. Link: arxiv.org/abs/2505.17813 1/n
๐ I'm excited to share that our latest research titled: โToward Reliable Proof Generation with LLMs: Leveraging Analogical Guidance and Symbolic Verificationโ is now available on ArXiv ๐ arxiv.org/pdf/2505.14479 w/ @StrnYtn @HyadataLab
๐ Our paper DOVE ๐๏ธ has been accepted to #ACL2025 Findings! DOVE ๐๏ธ is a massive collection (250M!) of LLM outputs across different prompts, domains, and models, aimed at democratizing LLM evaluation research! Thanks to all collaborators! Paper: slab-nlp.github.io/DOVE/
"Summarize this text" out โ "Provide a 50-word summary, explaining it to a 5-year-old" in โ The way we use LLMs has changedโuser instructions are now longer, more nuanced, and packed with constraints. Interested in how LLMs keep up? ๐ค Check out WildIFEval, our new benchmark!
๐ AI is changing the world. Is AI regulation on the right track? ๐ค While regulators rely on benchmarking ๐, we show why it cannot guarantee AI behavior: arxiv.org/pdf/2501.15693 Excited about this multidisciplinary collaboration! @GabiStanovsky, @RKeydar , @GadiPerl
There's a lot of talk about regulating AI, but do regulators know the technology well enough? In our new paper, we survey major reg efforts & find they rely on benchmarking, which we know to be problematic. How did this happen & what can we do about it? arxiv.org/pdf/2501.15693
If you're at #NeurIPS2024 don't miss @nitzanguetta's poster. There are some really FUN #VisualRiddles by @EliyaHabba. Not there? Checkout the project's Github!
๐จ Happening NOW at #NeurIPS2024 with @nitzanguetta ! ๐ญ #VisualRiddles: A Commonsense and World Knowledge Challenge for Vision-Language Models. ๐ East Ballroom C, Creative AI Track ๐ visual-riddles.github.io
Look at the CRAZY domain gap we found in summarization datasets: while English resources are diverse, other languages are mostly restricted to news. Presenting our survey following 130+ datasets in 100+ languages! Explore: github.com/edahanoam/Awesโฆ @GabiStanovsky, @nlphuji 1/6