Omar Shaikh
@oshaikh13
CS Ph.D. student @Stanford - previously @GeorgiaTech
What if LLMs could learn your habits and preferences well enough (across any context!) to anticipate your needs? In a new paper, we present the General User Model (GUM): a model of you built from just your everyday computer use. 🧵
🤖 Evaluating Human-AI Systems? Time to raise the bar. Check out SPHERE: An Evaluation Card for Human-AI Systems at ACL 2025 poster! 🗓️ July 28 18:00 📍 Hall X4/X5 🔗 sphere-eval.github.io Let’s talk transparent, rigorous, and human-centric evaluation! #ACL2025NLP #humanai
bro is cheating on this too💀
in the past week at @cluely, we've been kicking off our most ambitious project ever. the models of today are great at answering questions. the models at @cluely will be really good at predicting which questions you have. this is a fundamentally different user experience than…
new paper 🌟 interpretation of uncertainty expressions like "i think" differs cross-linguistically. we show that (1) llms are sensitive to these differences but (2) humans overrely on their outputs across languages
User simulators bridge RL with real-world interaction // jessylin.com/2025/07/10/use… How do we get the RL paradigm to work on tasks beyond math & code? Instead of designing datasets, RL requires designing environments. Given that most non-trivial real-world tasks involve…
If you're attending #ICML2025, check out our 💭 Agent Workflow Memory for online adaptive agents: Jul 17 4:30-7pm @ West Hall 🔎 RAGGED for designing scalable and stable RAG systems: Jul 16 11:00-13:30 @ East Hall Computer Use Agent Workshop on Jul 19 🌐 "Universal Retrieval for…
Job seekers are using LLMs to boost their resumes. Are companies interviewing the best candidates ... or just the candidates using the best LLM? 🧐 Our new ICML paper presents a fair and accurate hiring algorithm under stochastic manipulations: 📄 arxiv.org/abs/2502.13221 🧵 1/5
Can you tell what actions are being mimed in this video? If so, you’re smarter than AI models! Check the last tweet in this thread for answers. In a new paper, we present MIME, which evaluates whether vision language models (VLMs) have a robust understanding of human actions. 🧵
Thank you to everyone for your energy and enthusiasm in joining this adventure with me so far!
individual reporting for post-deployment evals — a little manifesto (& new preprints!) tldr: end users have unique insights about how deployed systems are failing; we should figure out how to translate their experiences into formal evaluations of those systems.
Men are much more likely to self-promote their papers on Twitter/X than women
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
seems big AI labs are hyperfixating on reasoning when they should focus on *memory* instead normal people won't use models that can think for hours to solve hard math problems people want models that learn over time, remember details, adapt and interact like a person would
Verrrrry intriguing-looking and labor-intensive test of whether LLMs can come up with good scientific ideas. After implementing those ideas, the verdict seems to be "no, not really."
New paper: What if neural networks assessed similarity like humans? We introduce Tversky Neural Networks, based on Tversky's (1977) psychological theory of similarity. These models enable efficient, interpretable, and psychologically plausible deep learning. (1/8)
AI companions aren’t science fiction anymore 🤖💬❤️ Thousands are turning to AI chatbots for emotional connection – finding comfort, sharing secrets, and even falling in love. But as AI companionship grows, the line between real and artificial relationships blurs. 📰 “Can A.I.…
Evaluating policies on a real robot can be painful. Can we use a world model to get a rough estimate of how good a policy is? Checkout "Evaluating Robot Policies in a World Model". Paper: arxiv.org/abs/2506.00613 Demo: world-model-eval.github.io Code: github.com/world-model-ev…
new multi-turn instruction grounding dataset with @wp_mccarthy and @saujasv - multi-modal instruction : drawing + txt - verifiable execution : 2D CAD gym env - easy eval : API → score - baselines : human vs VLMs - large : 15,163 inst-exe rounds github.com/AutodeskAILab/… [1/n]
A bit late to announce, but I’m excited to share that I'll be starting as an assistant professor at the University of Maryland @umdcs this August. I'll be recruiting PhD students this upcoming cycle for fall 2026. (And if you're a UMD grad student, sign up for my fall seminar!)
This is THE paper to share with friends and family who want a realistic perspective on how AI will affect their careers. Banger from @EchoShao8899!!!
🚨 70 million US workers are about to face their biggest workplace transmission due to AI agents. But nobody asks them what they want. While AI races to automate everything, we took a different approach: auditing what workers want vs. what AI can do across the US workforce.🧵
this time, by none other than MIPRO's co-creator @michaelryan207 himself