Rohit Saxena
@rohit_saxena
PhD Student @Edin_CDT_NLP @EdinburghNLP
📣This work will appear at the ICLR 2025 Workshop on Reasoning and Planning for LLMs.🇸🇬 I'm currently on the job market, looking for research scientist roles. Feel free to reach out if you're hiring or know of any opportunities!
LLMs can tackle math olympiad probs but... can they read a clock 🤔? 🕰️📆 Our experiments reveal surprising failures in temporal reasoning—MLLMs struggle with analogue clock reading & date inference! Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs
We couldn’t be there in person, but our poster will be at #NAACL2025! Feel free to ping @aryopg with any questions or follow-ups.
MMLU-Redux just touched down at #NAACL2025! 🎉 Wish I could be there for our "Are We Done with MMLU?" poster today (9:00-10:30am in Hall 3, Poster Session 7), but visa drama said nope 😅 If anyone's swinging by, give our research some love! Hit me up if you check it out! 👋
'Theorem Prover as a Judge for Sythetic Data Generation' has been accepted to ACL (Main) 🚀. Do check us out at July 30th (Wednesday) 11:00- 12:30pm at Hall 4/5! A huge thank you to my amazing collaborators: Shay @GiwonHong413849 @WendaLi8 📝: aclanthology.org/2025.acl-long.…
New Anthropic Research: “Inverse Scaling in Test-Time Compute” We found cases where longer reasoning leads to lower accuracy. Our findings suggest that naïve scaling of test-time compute may inadvertently reinforce problematic reasoning patterns. 🧵
Why do AI assistants feel so generic? Our new #ACL2025 paper, PersonaLens🔎, tackles this head-on. We built a new benchmark to test personalization in ways that matter. I'll be presenting our work at the poster session in Vienna next week! 🧵[1/4]
🚨New paper alert!🚨 "Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them" @ActInterp ICML'25 @deepseek_ai popularised RLVR and distillation for 'reasoning training'! But how do they differ under the hood? Details in 🧵: (1/8)
Pictures are taken from the paper, which was published at the ICLR 2025 Workshop on Reasoning and Planning for LLMs: openreview.net/forum?id=5gfC2… Nice work @rohit_saxena @aryopg and @PMinervini.
🔁 What if you could bootstrap a world model (state1 × action → state2) using a much easier-to-train dynamics model (state1 × state2 → action) in a generalist VLM? 💡 We show how a dynamics model can generate synthetic trajectories & serve for inference-time verification 🧵👇
Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇
MMLongBench Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly
Check out our MMLongBench - a long-context benchmark for vision and language models. 🚀📏 Work led by amazing @ZhaoweiWang4
🚨 New paper! 🚨 Many recent LVLMs claim massive context windows, but can they handle long contexts on diverse downstream tasks? 🤔 💡In our new paper, we find that most models still fall short! We introduce MMLongBench, the first comprehensive benchmark for long-context VLMs:…
We propose Neurosymbolic Diffusion Models! We find diffusion is especially compelling for neurosymbolic approaches, combining powerful multimodal understanding with symbolic reasoning 🚀 Read more 👇
🚀Check out VISTA - a large-scale benchmark for scientific video summarization! #ACL2025 By amazing @dongqi_me
🚨 Long Paper Accepted at @aclmeeting 2025 main conference! 🚨 🎥 Our work "What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations" introduces VISTA, a large-scale benchmark for scientific video summarization. #ACL2025 #NLProc #LLMs 🧵(1/3)
Preprint: Can we learn to reason for story generation (~100k tokens), without reward models? Yes! We introduce an RLVR-inspired reward paradigm VR-CLI that correlates with human judgements of quality on the 'novel' task of Next-Chapter Prediction. Paper: arxiv.org/abs/2503.22828
📢Scaling test-time compute via generative verification (GenRM) is an emerging paradigm and shown to be more efficient than self-consistency (SC) for reasoning. But, such claims are misleading☠️ Our compute-matched analysis shows that SC outperforms GenRM across most budgets! 🧵