Shivalika Singh
@singhshiviii
Research Engineer @Cohere_Labs @cohere | @huggingface fellow 🤗 | “Research means that you don't know, but are willing to find out” ✨
LMArena is widely used for model evaluation, but is it measuring true progress? 🔮 In our work, "The Leaderboard Illusion", we reveal: 🔒 Private testing 📊 Data access asymmetries ⚠️ Overfitting risks 🚫 Silent deprecations Despite best intentions, arena policies favor a few!

Can we update model behavior immediately based on granular feedback from users? I think this is part of an important big picture direction — moves user feedback from cumbersome ask to adaptable in place learning. Work led by @RivardLuke and @yuntiandeng 🔥✨
🚀Introducing Chat Annotator— a free chatbot where users can highlight parts of responses, leave a comment, and have the model incorporate that feedback into its next output. Powered by Cohere Command-A. 👉Try it here: chatannotator.com
I’m very excited to be co-organizing this @NeurIPSConf workshop on LLM evaluations! Evaluating LLMs is a complex and evolving challenge. With this workshop, we hope to bring together diverse perspectives to make real progress. See the details below:
We are happy to announce our @NeurIPSConf workshop on LLM evaluations! Mastering LLM evaluation is no longer optional -- it's fundamental to building reliable models. We'll tackle the field's most pressing evaluation challenges. For details: sites.google.com/corp/view/llm-…. 1/3
@cohere packing only the essential items for @aclmeeting 2025
There’s less than one week to go until @aclmeeting in Vienna, Austria! 🇦🇹 The Cohere Labs and @Cohere research teams are looking forward to showcasing some of our latest research and connecting with the community. Be sure to stop by our booth and say hello!
We have an incredible roster of accepted papers at @aclmeeting 2025. I will be there, as will many of our senior and engineering staff @mziizm @beyzaermis @mrdanieldsouza @singhshiviii 🔥 Looking forward to catching up with everyone.
There’s less than one week to go until @aclmeeting in Vienna, Austria! 🇦🇹 The Cohere Labs and @Cohere research teams are looking forward to showcasing some of our latest research and connecting with the community. Be sure to stop by our booth and say hello!
Sometimes it is important to take a moment and celebrate -- we achieved all of this in 3 years. Pretty incredible impact from @Cohere_Labs 🔥
More about the project here, from first author @singhshiviii: x.com/singhshiviii/s…
I joined this project a year ago with just the hope of getting a glimpse of AI research. I had no idea it would end up becoming such a special collaboration of 3000+ people from all over the world! Looking back, I could not have asked for a better introduction to research :)
This is one of my favorite sections in the Aya dataset paper. It is towards the end of the paper, so probably isn't read often. It speaks to how the end breakthrough was completely intertwined with the geo-reality experienced by independent researchers around the world.
🚨New Recipe just dropped! 🚨 "LLMonade 🍋" ➡️ squeeze max performance from your multilingual LLMs at inference time !👀🔥 🧑🍳@ammar__khairi shows you how to 1⃣ Harvest your Lemons 🍋🍋🍋🍋🍋 2⃣ Pick the Best One 🍋
🚀 Want better LLM performance without extra training or special reward models? Happy to share my work with @Cohere_labs : "When Life Gives You Samples: Benefits of Scaling Inference Compute for Multilingual LLMs" 👀How we squeeze more from less at inference 🍋, details in 🧵
When Life Gives You Samples The Benefits of Scaling up Inference Compute for Multilingual LLMs
🚀 Want better LLM performance without extra training or special reward models? Happy to share my work with @Cohere_labs : "When Life Gives You Samples: Benefits of Scaling Inference Compute for Multilingual LLMs" 👀How we squeeze more from less at inference 🍋, details in 🧵
In just 3 years, we’ve published 95 papers through Cohere Labs — with contributions and collaboration from over 60 institutions. These papers span topics from core ML research topics and reflect what’s possible when researchers come together to explore the unknown, together.
There are a limited number of spaces left for our first physical event in London on 14th July. Act quickly if you'd like to attend! We have Tim Nguyen (@IAmTimNguyen) from DeepMind and Max Bartolo (@max_nlp) from Cohere and Enzo Blindow (VP of Data, Research & Analytics) from…
Let's get studious. 🏫 This July join the Cohere Labs Open Science Community for ML Summer School. You'll be part of a global community exploring the future of ML and hear from speakers across the industry. Register to be first to hear about the line-up & connect with others.
Worth reading this research which showed it has already been turned into cheat-slop and that meta was one of the worst culprits for gaming it x.com/singhshiviii/s…
LMArena is widely used for model evaluation, but is it measuring true progress? 🔮 In our work, "The Leaderboard Illusion", we reveal: 🔒 Private testing 📊 Data access asymmetries ⚠️ Overfitting risks 🚫 Silent deprecations Despite best intentions, arena policies favor a few!
🚨 Wait, adding simple markers 📌during training unlocks outsized gains at inference time?! 🤔 🚨 Thrilled to share our latest work at @Cohere_Labs: “Treasure Hunt: Real-time Targeting of the Long Tail using Training-Time Markers“ that explores this phenomenon! Details in 🧵 ⤵️
Can we train models for better inference-time control instead of over-complex prompt engineering❓ Turns out the key is in the data — adding fine-grained markers boosts performance and enables flexible control at inference🎁 Huge congrats to @mrdanieldsouza for this great work
🚨 Wait, adding simple markers 📌during training unlocks outsized gains at inference time?! 🤔 🚨 Thrilled to share our latest work at @Cohere_Labs: “Treasure Hunt: Real-time Targeting of the Long Tail using Training-Time Markers“ that explores this phenomenon! Details in 🧵 ⤵️
🤹 How do we move away from complicated and brittle prompt engineering at inference for under-represented tasks?🤔 🧠 Our latest work finds that optimizing training protocols improves controllability and boosts performance on underrepresented use cases at inference time 📈
Thanks @_akhaliq for the spotlight on our work I really believe strongly in this wider direction — of taking the pressure off everyday users to be master prompt engineers and inferring controllability directly from tasks.
Cohere presents Treasure Hunt Real-time Targeting of the Long Tail using Training-Time Markers
🚨New pretraining paper on multilingual tokenizers 🚨 Super excited to share my work with @Cohere_Labs: One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers