Gabriel Stanovsky
@GabiStanovsky
Assistant Professor at @CseHuji
Check out @niveckhaus 's excellent work, developing a model capable of playing human players in asynchronous settings, deciding when to intervene or when to stay quiet 🤐
🚨 New Paper: "Time to Talk"! 🕵️ We built an LLM agent that doesn't just decide WHAT to say, but also WHEN to say it! Introducing "Time to Talk" - LLM agents for asynchronous group communication, tested in real Mafia games with human players. 🌐niveck.github.io/Time-to-Talk 🧵1/7
Ever wondered how Transformers refine their top-k predictions over their layers? 📊 Is there an order to the madness? Come find out at my poster presentation tommorow at @icmlconf 📍East Exhibition Hall E-2512, 11:00-13:30
🚨New paper alert🚨 🧠 Instruction-tuned LLMs show amplified cognitive biases — but are these new behaviors, or pretraining ghosts resurfacing? Excited to share our new paper, accepted to CoLM 2025🎉! See thread below 👇 #BiasInAI #LLMs #MachineLearning #NLProc
🕊️ DOVE is a living benchmark! Just pushed major updates: 📊 Dataset expansion: Added ~5700 MMLU examples with Llama-70B - each tested across 100 different prompt variations = 570K new predictions! 📈 Website upgrades: New interactive plots throughout- slab-nlp.github.io/DOVE/
Care about LLM evaluation? 🤖 🤔 We bring you🕊️ DOVE a massive (250M!) collection of LLMs outputs On different prompts, domains, tokens, models... Join our community effort to expand it with YOUR model predictions & become a co-author!
We built and released the #LLMafia Dataset 🕵️♂️ 🎲 21 games 💬 2558 messages 🤖 211 messages from the LLM agent 🤗 Available on HuggingFace: huggingface.co/datasets/nivec… In the image: a real sample from our dataset 🧵5/7
🚨 New Paper: "Time to Talk"! 🕵️ We built an LLM agent that doesn't just decide WHAT to say, but also WHEN to say it! Introducing "Time to Talk" - LLM agents for asynchronous group communication, tested in real Mafia games with human players. 🌐niveck.github.io/Time-to-Talk 🧵1/7
🎉 Our paper DOVE 🕊️ has been accepted to #ACL2025 Findings! DOVE 🕊️ is a massive collection (250M!) of LLM outputs across different prompts, domains, and models, aimed at democratizing LLM evaluation research! Thanks to all collaborators! Paper: slab-nlp.github.io/DOVE/
Over the jet lag but missing #NAACL2025 and the famous gazebo - best time for highlights! 1. @radamihalcea 's “long tail of the world” metaphor really stuck with me: most of us are from small, often-overlooked cultures. Many papers in special track try to bridge this gap
Had an awesome time presenting both my talk and poster @naaclmeeting! Will miss having beer at the Sister pub 🍻 🎤 arxiv.org/abs/2409.16646 📌 arxiv.org/abs/2406.13274
Accepted at #icml2025🥳 Camera ready version (with newer models like Llama-3 and Qwen-Audio) coming soon!
📢Paper release📢 What computation is the Transformer performing in the layers after the top-1 becomes fixed (a so called "saturation event")? We show that the next highest-ranked tokens also undergo saturation *in order* of their ranking. Preprint: arxiv.org/abs/2410.20210 1/4
If you're in @naaclmeeting and interested in cross-cultural research (like everyone else here...) come see my talk today. Ruidoso room 17:00, see you there :)
Have you ever wondered if speakers of different languages focus on different entities when viewing the same image? Check our recent work to find out! arxiv.org/abs/2409.16646 w\ @PontiEdoardo
We're at #NAACL2025! Presenting: 📍Cross-Lingual and Cross-Cultural Variation in Image Descriptions Thu May 1, 5:00 PM Ruidoso 📍The State and Fate of Summarization Datasets: A Survey Fri May 2, 12:00 PM Ruidoso @uriberger88 , @Shachar_Don, @Dahan_Noam
Only three more days to submit your evaluation papers to our ACL workshop!
Are you recovering from your @COLM_conf abstract submission? Did you know that GEM has a non-archival track that allows you to submit a two-page abstract in parallel? Our workshop deadline is coming up, please consider submitting your evaluation paper!
Are you recovering from your @COLM_conf abstract submission? Did you know that GEM has a non-archival track that allows you to submit a two-page abstract in parallel? Our workshop deadline is coming up, please consider submitting your evaluation paper!
Care about LLM evaluation? 🤖 🤔 We bring you🕊️ DOVE a massive (250M!) collection of LLMs outputs On different prompts, domains, tokens, models... Join our community effort to expand it with YOUR model predictions & become a co-author!
"Summarize this text" out ❌ "Provide a 50-word summary, explaining it to a 5-year-old" in ✅ The way we use LLMs has changed—user instructions are now longer, more nuanced, and packed with constraints. Interested in how LLMs keep up? 🤔 Check out WildIFEval, our new benchmark!
Can RAG performance get * worse * with more relevant documents?📄 We put the number of retrieved documents in RAG to the test! 💥Preprint💥: arxiv.org/abs/2503.04388 1/3
In-context learning assumes access to annotated datasets but in new domains we often label data ourselves with a limited budget. Given raw samples, how should we select demonstration samples for labeling? Read our paper: arxiv.org/abs/2406.13274 w\ @GabiStanovsky @talbaumel
🚨 Just Out Can LLMs extract experimental data about themselves from scientific literature to improve understanding of their behavior? We propose a semi-automated approach for large-scale, continuously updatable meta-analysis to uncover intriguing behaviors in frontier LLMs. 🧵
🚨New arXiv preprint!🚨 LLMs can hallucinate - but did you know they can do so with high certainty even when they know the correct answer? 🤯 In our latest work with @Itay_itzhak_, @FazlBarez, @GabiStanovsky, and @boknilev, we challenge assumptions about hallucination origin!