Mohaiminul (Emon) Islam (on job market)
@mmiemon
𝐎𝐧 𝐭𝐡𝐞 𝐈𝐧𝐝𝐮𝐬𝐭𝐫𝐲 𝐉𝐨𝐛 𝐌𝐚𝐫𝐤𝐞𝐭 | Phd Student @unccs | 2x Research Intern @MetaAI. Computer Vision, Video Understanding, Multimodal, AI Agents.
🚀 On the job market! Final-year PhD @ UNC Chapel Hill working on computer vision, video understanding, multimodal LLMs & AI agents. 2x Research Scientist Intern @Meta 🔍 Seeking Research Scientist/Engineer roles! 🔗 md-mohaiminul.github.io 📧 mmiemon [at] cs [dot] unc [dot] edu
Checkout our new paper: Video-RTS 🎥 A data-efficient RL method for complex video reasoning tasks. 🔹 Pure RL w/ output-based rewards. 🔹 Novel sparse-to-dense Test-Time Scaling (TTS) to expand input frames via self-consistency. 💥 96.4% less training data! More in the thread👇
🚨Introducing Video-RTS: Resource-Efficient RL for Video Reasoning with Adaptive Video TTS! While RL-based video reasoning with LLMs has advanced, the reliance on large-scale SFT with extensive video data and long CoT annotations remains a major bottleneck. Video-RTS tackles…
Great to see our paper ReVisionLLM featured by MCML blog! @gberta227 #CVPR2025
🚀 Check out our latest work, ReVisionLLM, now featured on the MCML blog! 🔍 A Vision-Language Model for accurate temporal grounding in hour-long videos. 👉 mcml.ai/news/2025-06-2… #VisionLanguage #MultimodalAI #MCML #CVPR2025
Had a great time presenting BIMBA at #CVPR2025 today! Engaging discussions, thoughtful questions, and lots of interest in our work on long-range VideoQA 🔍🎥 📝 Paper: arxiv.org/abs/2503.09590 🌐 Project: sites.google.com/view/bimba-mllm 🎥 Demo: youtu.be/YIU2XypsT-o
🚀New #CVPR2025 Paper🚀 Introducing BIMBA, an efficient multimodal LLM for long-range video QA💡 It sets SOTA on 7 VQA benchmarks by intelligently selecting key spatiotemporal tokens utilizing the selective scan mechanism of Mamba models. 🧵Thread below👇 arxiv.org/pdf/2503.09590
Come to our poster today at #CVPR2025! 🗓️ June 15 | 🕓 4–6PM 📍 Poster #282 | ExHall D 📝 Paper: arxiv.org/abs/2503.09590 🌐 Project: sites.google.com/view/bimba-mllm 💻 Code: github.com/md-mohaiminul/… 🎥 Youtube: youtu.be/YIU2XypsT-o
🚀New #CVPR2025 Paper🚀 Introducing BIMBA, an efficient multimodal LLM for long-range video QA💡 It sets SOTA on 7 VQA benchmarks by intelligently selecting key spatiotemporal tokens utilizing the selective scan mechanism of Mamba models. 🧵Thread below👇 arxiv.org/pdf/2503.09590
Great to see a lot of interest among the video understanding community about ReVisionLLM! If you missed it, checkout arxiv.org/abs/2411.14901 @hannan_tanveer
Presenting ReVisionLLM at #CVPR2025 today! Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos If you are at CVPR, please stop by 📍 Poster #307, Session 4 🗓️ June 14, 5–7PM | ExHall D 🔗 arxiv.org/pdf/2411.14901 @hannan_tanveer @gberta227
Presenting ReVisionLLM at #CVPR2025 today! Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos If you are at CVPR, please stop by 📍 Poster #307, Session 4 🗓️ June 14, 5–7PM | ExHall D 🔗 arxiv.org/pdf/2411.14901 @hannan_tanveer @gberta227
Another great accomplishment by Emon this #CVPR2025. Interestingly, rather than using some complex ensemble model, Emon won the EgoSchema challenge by simply applying his latest BIMBA model, which he will also present at the poster session on Sunday 4-6pm. Be sure to stop by!
🚀 Excited to share that we won 1st place at the EgoSchema Challenge at EgoVis, #CVPR2025! Our method (81%) outperformed human accuracy (76.2%) for the first time on this challenging task 🎯 Stop by #CVPR: 📍 Poster #282 | June 15, 4–6PM | ExHall D 🔗 sites.google.com/view/bimba-mllm
🚀 Excited to share that we won 1st place at the EgoSchema Challenge at EgoVis, #CVPR2025! Our method (81%) outperformed human accuracy (76.2%) for the first time on this challenging task 🎯 Stop by #CVPR: 📍 Poster #282 | June 15, 4–6PM | ExHall D 🔗 sites.google.com/view/bimba-mllm



Very proud of this great accomplishment! Congrats @mmiemon! Well deserved!
Excited to share that our paper Video ReCap (#CVPR2024) won the EgoVis Distinguished Paper Award at #CVPR2025! Honored to see our work recognized and its impact on the video understanding community. Huge thanks to my co-authors and my advisor @gberta227 🔗 sites.google.com/view/vidrecap
Excited to share that our paper Video ReCap (#CVPR2024) won the EgoVis Distinguished Paper Award at #CVPR2025! Honored to see our work recognized and its impact on the video understanding community. Huge thanks to my co-authors and my advisor @gberta227 🔗 sites.google.com/view/vidrecap


If you are at #CVPR2025, checkout our Transformers for Vision (T4V) workshop!
@CVPR is around the corner!! Join us at the Workshop on T4V at #CVPR2025 with a great speaker lineup (@MikeShou1, @jw2yang4ai, @WenhuChen, @roeiherzig, Yuheng Li, Kristen Grauman) covering diverse topics! Website: sites.google.com/view/t4v-cvpr2… #CVPR #Transformer #Vision #T4V2025 #T4V
Had a great time presenting at the GenAI session @CiscoMeraki—thanks @nahidalam for the invite🙏 Catch us at #CVPR2025: 📌 BIMBA: arxiv.org/abs/2503.09590 (June 15, 4–6PM, Poster #282) 📌 ReVisionLLM: arxiv.org/abs/2411.14901 (June 14, 5–7PM, Poster #307) @gberta227 @hannan_tanveer
Nice explanation for KV caching!
KV caching in LLMs, clearly explained (with visuals):
🚀New paper out - We present Video-MSG (Multimodal Sketch Guidance), a novel planning-based training-free guidance method for T2V models, improving control of spatial layout and object trajectories. 🔧 Key idea: • Generate a Video Sketch — a spatio-temporal plan with…
Today is the start of a new era of natively multimodal AI innovation. Today, we’re introducing the first Llama 4 models: Llama 4 Scout and Llama 4 Maverick — our most advanced models yet and the best in their class for multimodality. Llama 4 Scout • 17B-active-parameter model…
Can visual SSL match CLIP on VQA? Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.
🚨 New #CVPR2025 Paper 🚨 🏀BASKET: A Large-Scale Dataset for Fine-Grained Basketball Skill Estimation🎥 4,477 hours of videos⏱️ | 32,232 players⛹️ | 20 fine-grained skills🎯 We present a new video dataset for skill estimation with unprecedented scale and diversity! A thread👇
For those of you who know me, I've always been very excited to combine my two passions for basketball and CV. Our #CVPR2025 paper does this by introducing a large-scale video dataset for fine-grained skill estimation in 🏀. Paper, code & data available: sites.google.com/cs.unc.edu/bas…
🚨 New #CVPR2025 Paper 🚨 🏀BASKET: A Large-Scale Dataset for Fine-Grained Basketball Skill Estimation🎥 4,477 hours of videos⏱️ | 32,232 players⛹️ | 20 fine-grained skills🎯 We present a new video dataset for skill estimation with unprecedented scale and diversity! A thread👇