Weiming Ren
@wmren993
CS PhD student @UWaterloo @UWCheritonCS
📢 Introducing VisCoder – fine-tuned language models for Python-based visualization code generation and feedback-driven self-debugging. Existing LLMs struggle to generate reliable plotting code: outputs often raise exceptions, produce blank visuals, or fail to reflect the…
Introducing VerlTool - a unified and easy-to-extend tool agent training framework based on verl. Recently, there's been a growing trend toward training tool agents with reinforcement learning algorithms like GRPO and PPO. Representative works include SearchR1, ToRL, ReTool, and…
🚀 New Paper: Pixel Reasoner 🧠🖼️ How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself? We introduce Pixel Reasoner, the first open-source framework that enables VLMs to “think in pixel space” through curiosity-driven reinforcement…
🧠📽️ New benchmark release: VideoEval-Pro! Long Video Understanding (LVU) is critical for building truly intelligent multimodal systems — think surveillance analysis, instructional video QA, or summarizing hour-long meetings. But here's the problem👇 🧩 Nearly all existing LVU…
Excited to share VideoEval-Pro, a robust and comprehensive evaluation suite for long video understanding (LVU) models. 📊1,289 open-ended questions from 465 long videos (avg. 38 mins) 🎯Diverse task types: perception and reasoning tasks based on local and holistic video contents
🎬 Automated filmmaking is the future — You need dialogue, expressive talking heads, synchronized body motion, and multi-character interactions. 🚀 Today, in collaboration with @AIatMeta, we’re excited to introduce MoCha: Towards Movie-Grade Talking Character Synthesis 🔊…
🚀Thrilled to introduce ☕️MoCha: Towards Movie-Grade Talking Character Synthesis Please unmute to hear the demo audio. ✨We defined a novel task: Talking Characters, which aims to generate character animations directly from Natural Language and Speech input. ✨We propose…
Excited to share what I've been working on lately: ABC - A multimodal embedding model trained for embedding specific aspects of an image. ABC is perfect for visual embedding tasks that require a little more control over the embedding. Details on the training pipeline 👇
🚨 New Paper Alert! 🚨 Thrilled to announce VAMBA: a powerful hybrid Mamba-Transformer architecture designed specifically for hour-long video understanding tasks! VAMBA can receive more than 1000 frames on a single GPU card efficiently! 🎯 Why do we need hour-long video models?…
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers VAMBA is a hybrid Mamba-Transformer model for long video understanding that uses Mamba-2 blocks to encode video tokens with linear complexity. It handles over 1024 frames without token reduction, reducing GPU…