Orr Zohar
@orr_zohar
@nvidia • @Stanford • @KnightHennessy scholar • Researching large multimodal models
🧵 Introducing TimeScope, an open-source benchmark rigorously evaluating the true “temporal context window” of video-language models on videos ranging from 1 minute to 8 hours. #AI #MachineLearning
🧠 How can we truly test long-context video understanding in video-LMMs? ⏱️ TimeScope benchmarks models from 1 min to 8 hours using “needle-in-a-haystack” probes. 🚀 Gemini 2.5-Pro leads the pack—but even it struggles as context length grows. Long-range memory is still a…
🧵 Introducing TimeScope, an open-source benchmark rigorously evaluating the true “temporal context window” of video-language models on videos ranging from 1 minute to 8 hours. #AI #MachineLearning
Thrilled to announce our MiMo-VL series hit 100K downloads on HuggingFace last month! 🚀🚀 Incredible to see the community's enthusiasm for our VLMs. More exciting updates coming soon! 😜 huggingface.co/XiaomiMiMo/MiM…
timescope: testing if large models understand long videos or they just claim to do so 🤠 they randomly insert needles (short videos/static images) in long videos and ask questions about the needle itself 🤯 Gemini seems to be the best! very cool work by @orr_zohar et al 👏
SmolVLM has been accepted to @COLM_conf 2025 🥳! See you in Montreal!
Introducing the smollest VLMs yet! 🤏 SmolVLM (256M & 500M) runs on <1GB GPU memory. Fine-tune it on your laptop and run it on your toaster. 🚀 Even the 256M model outperforms our Idefics 80B (Aug '23). How small can we go? 👀
Today, we are open-sourcing our pipeline to deduplicate large-scale image datasets. On one GPU, we can deduplicate 10k images against 1M indexed test images in ~60 seconds. But how?
Robotics models are increasingly bulky and difficult to run directly on robots. With @RemiCadene and the team @LeRobotHF and @huggingface we’re changing that. Introducing SmolVLA, a sub-500M VLA designed for efficient training and inference. A thread 🧵
WE ARE COOKING!! I’m looking for a creative engineer to join the ride 🤩 If that’s you, send me a message 🚀 You should be someone who learns tools fast, builds scrappy hacks when needed, and focuses on what works. You might be working in the space of media, image/video…
New open-source drop from the HF team - nanoVLM A super tight codebase to learn/train VLM with good performances - inspired by @karpathy 's NanoGPT 750 lines of pytorch code. Training a 222M parameters nanoVLM for 6 hours on a single H100 reaches 35.3% on MMStar, matching the…
Today, we are open-sourcing nanoVLM, a pure pytorch library to train a Vision-Language Model from scratch in 750 lines of code. Training on one H100 for 6h, we get 35.3% on MMStar, matching SmolVLM-256M which was trained with 100x more GPU hours. 👀 Even in a FREE Google Colab,…
Alert alert, we got our first external contribution to the nanoVLM project! Thank you, @not_so_lain !
BOOOM! Learn VLMs from inside out in < 1000 lines of pure PyTorch code! 🔥 github.com/huggingface/na…
Excited to present Video-STaR at #ICLR2025’s poster session tomorrow! 🗓️ Visit me at Poster 91, 10:00 AM–12:30 PM 🚀 Dive into our work on advancing video reasoning using self-training:
🚀 Can self-training improve general LVLM performance? 🏎️ How can you adapt your LVLMs to new and diverse applications? 📢 Happy to announce Video-STaR, a self-training approach to utilize any supervision for video instruction tuning! 🧵👇