Zaid Khan
@codezakh
@uncnlp with @mohitban47 working on grounded reasoning + multimodal agents // currently @allen_ai formerly @neclabsamerica // bs+ms CompE @northeastern
What if we could transform advanced math problems into abstract programs that can generate endless, verifiable problem variants? Presenting EFAGen, which automatically transforms static advanced math problems into their corresponding executable functional abstractions (EFAs).…

The MUGen workshop at #ICML2025 is happening now! Stop by for talks on adversarial ML, unlearning as rational belief revision, failure modes in unlearning, robust LLM unlearning, and the bright vs. dark side of forgetting in generative AI!
🚨Exciting @icmlconf workshop alert 🚨 We’re thrilled to announce the #ICML2025 Workshop on Machine Unlearning for Generative AI (MUGen)! ⚡Join us in Vancouver this July to dive into cutting-edge research on unlearning in generative AI—featuring an incredible lineup of…
📢📢📢 Releasing OpenThinker3-1.5B, the top-performing SFT-only model at the 1B scale! 🚀 OpenThinker3-1.5B is a smaller version of our previous 7B model, trained on the same OpenThoughts3-1.2M dataset.
Overdue job update -- I am now: - A Visiting Scientist at @schmidtsciences, supporting AI safety and interpretability - A Visiting Researcher at the Stanford NLP Group, working with @ChrisGPotts I am so grateful I get to keep working in this fascinating and essential area, and…
I’ll be at #ICML2025 this week to present ScPO: 📌 Wednesday, July 16th, 11:00 AM-1:30 PM 📍East Exhibition Hall A-B, E-2404 Stop by or reach out to chat about improving reasoning in LLMs, self-training, or just tips about being on the job market next cycle! 😃
🚨 Self-Consistency Preference Optimization (ScPO)🚨 - New self-training method without human labels - learn to make the model more consistent! - Works well for reasoning tasks where RMs fail to evaluate correctness. - Close to performance of supervised methods *without* labels,…
🥳 Excited to share our work -- Retrieval-Augmented Generation with Conflicting Evidence -- on addressing conflict in RAG due to ambiguity, misinformation, and noisy/irrelevant evidence has been accepted to @COLM_conf #COLM2025! Our new benchmark RAMDocs proves challenging for…
🚨Real-world retrieval is messy: queries can be ambiguous, or documents may conflict/have incorrect/irrelevant info. How can we jointly address all these problems? We introduce: ➡️ RAMDocs, a challenging dataset with ambiguity, misinformation, and noise. ➡️ MADAM-RAG, a…
🚨Introducing Video-RTS: Resource-Efficient RL for Video Reasoning with Adaptive Video TTS! While RL-based video reasoning with LLMs has advanced, the reliance on large-scale SFT with extensive video data and long CoT annotations remains a major bottleneck. Video-RTS tackles…
🎉 Excited to share that TaCQ (Task-Circuit Quantization), our work on knowledge-informed mixed-precision quantization, has been accepted to #COLM2025 @COLM_conf! Happy to see that TaCQ was recognized with high scores and a nice shoutout from the AC – big thanks to @EliasEskin…
🚨Announcing TaCQ 🚨 a new mixed-precision quantization method that identifies critical weights to preserve. We integrate key ideas from circuit discovery, model editing, and input attribution to improve low-bit quant., w/ 96% 16-bit acc. at 3.1 avg bits (~6x compression)…
I've officially joined Meta Superintelligence Labs (MSL) org in the Bay Area. I'll be working on critical aspects of pre-training, synthetic data and RL for the next generation of models. Humbled and eager to contribute to the quest for superintelligence. @AIatMeta
🎉 Very excited to see TaCQ — our work on task-conditioned mixed-precision quantization that draws on interpretability methods — accepted to @COLM_conf #COLM2025 with strong scores and a nice shoutout from the AC! Kudos to Hanqi on leading this effort!
🚨Announcing TaCQ 🚨 a new mixed-precision quantization method that identifies critical weights to preserve. We integrate key ideas from circuit discovery, model editing, and input attribution to improve low-bit quant., w/ 96% 16-bit acc. at 3.1 avg bits (~6x compression)…
🥳Our work UTGen & UTDebug on teaching LLMs to generate effective unit tests & improve code debugging/generation has been accepted to @COLM_conf #COLM2025! Stay tuned for more exciting results -- e.g., using 32B-scale UTGen models to improve debugging with frontier models like…
🚨 Excited to share: "Learning to Generate Unit Tests for Automated Debugging" 🚨 which introduces ✨UTGen and UTDebug✨ for teaching LLMs to generate unit tests (UTs) and debugging code from generated tests. UTGen+UTDebug improve LLM-based code debugging by addressing 3 key…
🎉 Yay, welcome @hyunji_amy_lee -- super excited to have you join us as a postdoc! 🤗 Welcome to our MURGe-Lab + @unc_ai_group + @unccs family & the beautiful Research Triangle area -- looking forward to the many fun+impactful collaborations together 🔥
🥳Excited to share that I’ll be joining @unccs as postdoc this fall. Looking forward to work with @mohitban47 & amazing students at @unc_ai_group. I'll continue working on retrieval, aligning knowledge modules with LLM's parametric knowledge, and expanding to various modalities.
🥳Excited to share that I’ll be joining @unccs as postdoc this fall. Looking forward to work with @mohitban47 & amazing students at @unc_ai_group. I'll continue working on retrieval, aligning knowledge modules with LLM's parametric knowledge, and expanding to various modalities.
🎉 Excited to share that CAPTURe has been accepted to #ICCV2025! CAPTURe is a new benchmark for VLM reasoning that requires completing patterns to count objects that are occluded from view. We find that SOTA VLMs struggle with both counting and reasoning about partial patterns!…
Check out 🚨CAPTURe🚨 -- a new benchmark and task testing spatial reasoning by making VLMs count objects under occlusion. Key Takeaways: ➡️ SOTA VLMs (GPT-4o, Qwen2-VL, Intern-VL2) have high error rates on CAPTURe (but humans get very low error ✅) and models struggle to reason…
🥳 Excited to share that I’ll be joining the CS Department at UNC-Chapel Hill (@unccs @unc_ai_group) as an Assistant Professor starting Fall 2026! Before that, I’ll be working at Ai2 Prior (@allen_ai @Ai2Prior) and UW (@uwcse) on multimodal understanding and generation.
🎉 Yay, welcome to the @unc @unccs @unc_ai_group family and beautiful Research Triangle area, Jason! Looking forward to the many exciting collaborations on these topics! 🔥 PS. If you are applying for fall2026 PhD admissions, make sure to apply to new faculty member Jason 👇
🚀 Excited to introduce a new member of the LRM (Large Reconstruction Models) family — 4D-LRM! 1. What is 4D-LRM? It’s a large-scale space-time model that reconstructs a dynamic object from any few views at any time to any view at any other time. 2. What does it do? 🔁 Learn…
Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at any time to any view at any other time? Introducing 4D-LRM: a Large Space-Time Reconstruction Model that ... 🔹 Predicts 4D Gaussian primitives directly from…
🎉Excited to announce VEEGIE has been accepted to #ICCV2025 ! VEGGIE is a unified MLLM + Diffusion framework for instructional video editing. It presents a systematic approach spanning data, model, benchmark, and evaluation design, and shows strong multi-skill editing +…
🚨 Introducing VEGGIE 🥦—a unified, end-to-end, and versatile instructional video generative model. Current video editing methods struggle with: 1. Understanding direct user instructions 2. Handling diverse editing skills in one model 3. balancing multiple training…
🚨 Excited to announce MF2, a new+challenging long-video understanding dataset! MF2 covers open-license movies and focuses on key events/arcs/causal chains in the film. While people can answer MF2 questions easily, even the strongest models like Gemini 2.5 pro struggle with it!…
🚨Meet MF²: Movie Facts & Fibs: a new benchmark for long-movie understanding! 🤔Do you think your model understands movies? Unlike existing benchmarks, MF² targets memorable events, emotional arcs 💔, and causal chains 🔗 — things humans recall easily, but even top models like…
New paper Alert 🚨 Introducing MEXA: A general and training-free multimodal reasoning framework via dynamic multi-expert skill selection, aggregation and deep reasoning! MEXA: 1. Selects task- and modality-relevant experts based on the query and various required multimodal…
We evaluated more than 1000 reasoning LLMs on 12 reasoning-focused benchmarks and made fascinating observations about cross-benchmark comparisons. You can explore all that data yourself on our HuggingFace spaces page. (1/4)