Hritik Bansal
@hbXNov
CS PhD @UCLA Intern @MetaAI FAIR | Prev: Bachelors @IITDelhi, Intern @GoogleDeepMind @AmazonScience | Multimodal ML, Language models | Cricket🏏
📢Scaling test-time compute via generative verification (GenRM) is an emerging paradigm and shown to be more efficient than self-consistency (SC) for reasoning. But, such claims are misleading☠️ Our compute-matched analysis shows that SC outperforms GenRM across most budgets! 🧵

LaViDa: A Large Diffusion Language Model for Multimodal Understanding "We introduce LaViDa, a family of VLMs built on DMs. We build LaViDa by equipping DMs with a vision encoder and jointly fine-tune the combined parts for multimodal instruction following. " "LaViDa achieves…
🙌 We've released the full version of our paper, OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles Our OpenVLThinker-v1.2 is trained through three lightweight SFT → RL cycles, where SFT first “highlights” reasoning behaviors and RL then explores and…
📢📢📢 Releasing OpenThinker3-1.5B, the top-performing SFT-only model at the 1B scale! 🚀 OpenThinker3-1.5B is a smaller version of our previous 7B model, trained on the same OpenThoughts3-1.2M dataset.
Excited to share that I will join @Meta FAIR (Seattle 🗻) for my final summer internship w/ @ramakanth1729! 🧑🎓Looking forward to meeting new people, learning new things, and chatting about data, algorithms, and evaluation for LLM/VLM reasoning.
🥳 Excited to share that VideoPhy-2 has been awarded 🏆 Best Paper at the World Models Workshop (physical-world-modeling.github.io) #ICML2025! Looking forward to presenting it as a contributed talk at the workshop! 😃 w/ @clarkipeng @YonatanBitton Roman @adityagrover_ @kaiwei_chang…
Video generative models hold the promise of being general-purpose simulators of the physical world 🤖 How far are we from this goal❓ 📢Excited to announce VideoPhy-2, the next edition in the series to test the physical likeness of the generated videos for real-world actions. 🧵
🚨 New work: LLMs still struggle at Event Detection due to poor long-context reasoning and inability to follow task constraints, causing precision and recall errors. We introduce DiCoRe — a lightweight 3-stage Divergent-Convergent reasoning framework to fix this.🧵📷 (1/N)
Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data…
🧑🍳Very excited to present LaViDa, one of the first diffusion language models for multimodal understanding! 🌟Unlike autoregressive LMs, you can control the speed-quality tradeoff, and solve constrained generation problems out of the box 📦 🌟 We also release LaViDa-Reason, a…
📢(1/11)Diffusion LMs are fast and controllable at inference time! But why restrict such benefits for processing text data? We are excited to announce LaViDa, one of the first and fastest large diffusion LM for vision-language understanding!!
Great to see that the latest #GeminiDiffusion release benchmarks on our challenging general-purpose reasoning Big Bench Extra Hard dataset! It is now available on HF 🤗: huggingface.co/datasets/BBEH/… Eval code: github.com/google-deepmin…
Is BIG-Bench Hard too easy for your LLM? We just unleashed BIG-Bench EXTRA Hard (BBEH)! 😈 Every task, harder! Every model, humbled! (Poem Credit: Gemini 2.0 Flash) Massive headroom for progress across various areas in general reasoning 🤯
📢 Submit your cool ideas as short or long papers to the first workshop on the foundations of long video generation, understanding and evaluation 🚀 ramoscsv.github.io/longvid_founda…
📢 Announcing our 1st Workshop on Long Multi-Scene Video Foundations @ #ICCV2025 (@ICCVConference) in Honolulu, Hawaii! Co-organized by @regev_cohen, @SivanDoveh, @hila_chefer , Jehanzeb Mirza, @hbXNov , @inbar_mosseri , Joao Magalhaes and me. website: ramoscsv.github.io/longvid_founda…