Zhaowei Wang
@ZhaoweiWang4
Visiting @EdinburghNLP with Mark Steedman | PhD student @hkustNLP and @HKUSTKnowComp with @yqsong | Previous Intern @NVIDIAAI and @TencentGlobal
Check our MMLongBench with comprehensive vision-language long-context applications!
MMLongBench Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly
'Theorem Prover as a Judge for Sythetic Data Generation' has been accepted to ACL (Main) 🚀. Do check us out at July 30th (Wednesday) 11:00- 12:30pm at Hall 4/5! A huge thank you to my amazing collaborators: Shay @GiwonHong413849 @WendaLi8 📝: aclanthology.org/2025.acl-long.…
Transformers struggle with length generalization and long context. What can we do about it? Our new #TMLR paper with @rolandalong , @paul_smolensky and @JianfengGao0217 shows how to handle the issue. Using a new attention mechanism called TRA. Curious? Read the 🧵 for more 🤓
A very insightful work about RL and SFT from Neel
🚨New paper alert!🚨 "Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them" @ActInterp ICML'25 @deepseek_ai popularised RLVR and distillation for 'reasoning training'! But how do they differ under the hood? Details in 🧵: (1/8)
🚨New paper alert!🚨 "Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them" @ActInterp ICML'25 @deepseek_ai popularised RLVR and distillation for 'reasoning training'! But how do they differ under the hood? Details in 🧵: (1/8)
🦉 Automate complex tasks using Gemini 2.5 Pro and @CamelAIOrg’s OWL (Optimized Workforce Learning), an open-source multi-agent collaboration framework that works together like a real-world project team.
📢 New paper alert 📢 We introduce MobileGUI-RL, an RL framework advancing mobile GUI agents through trajectory-based rollouts and rewards in 𝗼𝗻𝗹𝗶𝗻𝗲 environments. With RL, Qwen 2.5-VL achieves 44.8% Success on Android World! ✨ Checkout paper at: arxiv.org/abs/2507.05720
Have you noticed… 🔍 Aligned LLM generations feel less diverse? 🎯 Base models are decoding-sensitive? 🤔 Generations get more predictable as they progress? 🌲 Tree search fails mid-generation (esp. for reasoning)? We trace these mysteries to LLM probability concentration, and…
We built sparse-frontier — a clean abstraction that lets you focus on your custom sparse attention implementation while automatically inheriting vLLM’s optimizations and model support. As a PhD student, I've learned that sometimes the bottleneck in research isn't ideas — it's…
😵💫 Long-context human-AI planning with LLMs struggles when users have to manually manage all the context in messy chats (e.g. with ChatGPT). Meet 💡JumpStarter: task-structured context curation for better, collaborative planning with LLMs on complex tasks. 🧵 (1/n)
🚀 We release MMLongBench: Benchmark for evaluating long-context VLMs. 📊 13,331 examples across 5 tasks: – Visual RAG – Many-shot ICL – Needle-in-a-haystack – VL Summarization – Long-document VQA 📏 Lengths: 8 / 16 / 32 / 64 / 128K 🔍 Benchmarking both thoroughly & effectively!
We propose Neurosymbolic Diffusion Models! We find diffusion is especially compelling for neurosymbolic approaches, combining powerful multimodal understanding with symbolic reasoning 🚀 Read more 👇
🤔Looking for research ideas in AI safety? Check out our newest survey including nearly 1000 papers🚀 on full-stack AI safety, covering Data, Pre-training, Post-training, Model Editing, and Deployment (Agent). arXiv: arxiv.org/abs/2504.15585
🚀 Thrilled to share our new work on Theorem Proving! Our MPS-Prover sets a new SOTA for step-provers on miniF2F, solving 185/244 problems (75.82%)! Plus, it generates significantly SHORTER proof than whole provers. Our avg proof length: 3.5! Details: arxiv.org/pdf/2505.10962
Can multimodal LLMs truly understand research poster images?📊 🚀 We introduce PosterSum—a new multimodal benchmark for scientific poster summarization! 🪧 📂 Dataset: huggingface.co/datasets/rohit… 📜 Paper: arxiv.org/abs/2502.17540