Haidar Khan
@haidarkk1
Research Scientist (Currently @Meta, previously @SDAIA_SA, @Amazon). PhD CS @rpi. Scale smarter, not harder. Opinions generated from an independent LLM.
Here is some news I'm happy to share before the @iclr_conf FOMO really starts to set in 😢. We have been playing with the idea of using games as LLM evals (pun intended) for a while now and it's finally ready! ZeroSumEval is a scalable evaluation methodology that pits models…

Acquiring knowledge is easy, the hard part is knowing what to apply and when. That’s why all true learning is “on the job.” Life is lived in the arena.
I don't get why so many people are dunking on Meta. Building frontier AI models is freaking hard & they have consistently released 10 times more openly than others which is game-changer for the field. What about all the other big tech & startups with much more AI ressources…
So I'm not just inside the DSPy bubble? seeing one DSPy post after another...
guys we haven’t even released 3.0 yet
I’ve been to many wild places in the world but the beauty of the American wilderness is second to none. (Picture taken with Ray Ban Meta)

Food for thought for those building benchmarks and leaderboards...
It is critical for scientific integrity that we trust our measure of progress. The @lmarena_ai has become the go-to evaluation for AI progress. Our release today demonstrates the difficulty in maintaining fair evaluations on @lmarena_ai, despite best intentions.
What a blast! Farewell @iclr_conf and Singapore. Thanks for the great hospitality @sbmaruf !

Check out our new Arabic Safety leaderboard on huggingface!
🚀 The @aiastrolabe Arabic Safety Index (ASAS - أساس) is now live. We are officially launching the first-ever benchmark focused on Arabic LLM safety, and the results demand attention.
Our CRAG-MM Challenge (KDD Cup 2025) invites you to develop innovative multi-modal, multi-turn question-answering systems with a focus on RAG, using agentic tools to retrieve information. The goal is to improve visual reasoning: aicrowd.com/challenges/met…
I’ll be in Singapore for #ICLR2025 - would love to meet! DM me and let’s arrange something!
Our evals are showing even frontier models are woefully lacking on Arabic. It’s embarrassingly simple to elicit offensive content from leading models like GPT-4o by just talking to the model in Arabic. Regionally developed models are hardly better… At AI Astrolabe we are…
Arabic AI is being left behind. We are here to change that. At @aiastrolabe, we’re building the gateway to #ArabicAI; pushing the frontier across dialects, domains, and capabilities, and making sure models are safe, culturally grounded, and built for our communities.