Zikui Cai
@ZikuiCai
Postdoc @UofMaryland @umiacs
Introducing MORSE-500 🌐 morse-500.github.io 500 scripted videos that stress-test six reasoning skills — beyond math, beyond static pics, built to get harder. Key Features: 🚀 Fresh & Portable 🎯 Diverse Categories 👁️ Pure Visual Cues 📈 Scalable Difficulty Dive in 🧵

I value a model that visually explains its thinking for easy, precise understanding! Excited to share our new Zebra-CoT dataset, enabling VLMs to reason with images + text. Explore: arxiv.org/abs/2507.16746 #AI #MachineLearning #CoT #VisualReasoning #MultimodalReasoning #VLMs
🚨Announcing Zebra-CoT, a large-scale dataset of high quality interleaved image-text reasoning traces 📜. Humans often draw visual aids like diagrams when solving problems, but existing VLMs reason mostly in pure text. 1/n
Thanks for sharing! Building and optimizing agentic systems with DSPy has been a great experience!
AegisLLM leverages DSPy's MIPROv2 optimizer in a totally unexpected way: to evolve its prompts based on the attacks it sees in real time. Some really large gains!
Thanks for sharing our work!
If you are interested in building agentic workflows, AegisLLM is a nice instantiation in safety/security domain! Thanks @furongh for sharing it with me. Agentic workflows must be designed and optimized as systems, as @lateinteraction keeps repeating.
🐭🔒 LLM security is a cat-and-mouse game. Attackers adapt. Prompts mutate. Meanwhile, most defenses? 🚫 Static. Fragile. One-shot fixes. It’s time for something smarter. ⚔️ Meet AegisLLM: An agentic runtime defense that thinks, reacts, and learns — just like the attackers do.…
There’s been heated debate lately: Can generative AI truly self-improve? ✅Some say yes, pointing to models learning like curious humans. ❌Others say no, invoking the first law of thermodynamics: You can’t get something from nothing. No new info, no gain. 🧠 But what if the…
Introducing MORSE-500 🌐 morse-500.github.io 500 scripted videos that stress-test six reasoning skills — beyond math, beyond static pics, built to get harder. Key Features: 🚀 Fresh & Portable 🎯 Diverse Categories 👁️ Pure Visual Cues 📈 Scalable Difficulty Dive in 🧵
Excited to speak at the Workshop on Computer Vision in the Wild @CVPR 2025! 🎥🌍 🗓️ June 11 | 📍 Room 101 B, Music City Center, Nashville, TN 🎸 🧠 Talk: From Perception to Action: Building World Models for Generalist Agents Let’s connect if you're around! #CVPR2025 #robotics…
Introducing ARGUS 👁️ A benchmark for measuring hallucinations and omissions in free-form captions generated by Video-LLMs.
MORSE-500 A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning "We introduce MORSE-500 (Multimodal Reasoning Stress-test Environment), a video benchmark composed of 500 fully scripted clips with embedded questions spanning six complementary…