Yuansheng Ni
@YuanshengNi
Keep Learning!
📢 Introducing VisCoder – fine-tuned language models for Python-based visualization code generation and feedback-driven self-debugging. Existing LLMs struggle to generate reliable plotting code: outputs often raise exceptions, produce blank visuals, or fail to reflect the…
🚀 This year, we’ve rolled out a series of updates to EasyEdit1—and dropped EasyEdit2 to steer LLM behavior on the fly! 🔧✨ 👉 Code: github.com/zjunlp/EasyEdit What’s new? • Datasets: Integrated AKEW, LEME & UNKE • Methods: NAMET, CORE, UNKE, AnyEdit & Reference-free Preference…
🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -…
Thanks for sharing our work!
VisCoder Fine-Tuning LLMs for Executable Python Visualization Code Generation
🚨 New Paper Alert 🚨 We found that Supervised Fine-tuning on ONE problem can achieve similar performance gain as RL on ONE problem with 20x less compute! Paper: arxiv.org/abs/2506.03295 Recently, people have shown that RL can work even with ONE example. This indicates that the…
Introducing VerlTool - a unified and easy-to-extend tool agent training framework based on verl. Recently, there's been a growing trend toward training tool agents with reinforcement learning algorithms like GRPO and PPO. Representative works include SearchR1, ToRL, ReTool, and…
📷 New Benchmark Release: PhyX - Physical Reasoning for Multimodal Models 👉 Project Page: phyx-bench.github.io 👉 Github: github.com/NastyMarcus/Ph… 👉 arXiv: arxiv.org/abs/2505.15929 👉 Huggingface Dataset: huggingface.co/datasets/Cloud…
Excited to share VideoEval-Pro, a robust and comprehensive evaluation suite for long video understanding (LVU) models. 📊1,289 open-ended questions from 465 long videos (avg. 38 mins) 🎯Diverse task types: perception and reasoning tasks based on local and holistic video contents
🚀 New Paper: Pixel Reasoner 🧠🖼️ How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself? We introduce Pixel Reasoner, the first open-source framework that enables VLMs to “think in pixel space” through curiosity-driven reinforcement…
Introducing QuickVideo, 🚀 speeding up the end-to-end time from the mp4 bit stream to VideoLLM inference by at least 2.5 times for hour-long video understanding (e.g 1024 frames) on a single 40GB GPU. 🤔What are the key challenges of hour-long video understanding? 1.…
🚀 General-Reasoner: Generalizing LLM Reasoning Across All Domains (Beyond Math) Most recent RL/R1 works focus on math reasoning—but math-only tuning doesn't generalize to general reasoning (e.g. drop on MMLU-Pro and SuperGPQA). Why are we limited to math reasoning? 1. Existing…
🔥 How do you build a state-of-the-art Vision-Language Model with direct RL? We’re excited to introduce VL-Rethinker, a new paradigm for multimodal reasoning trained directly with Reinforcement Learning. 📈 It sets new SOTA on key math+vision benchmarks: - MathVista: 80.3 → 🥇…
🚀Big WebDreamer update! We train 💭Dreamer-7B, a small but strong world model for real-world web planning. 💥Beats Qwen2-72B ⚖️Matches #GPT-4o Trained on 3M synthetic examples — and yes, all data + models are open-sourced.
❓Wondering how to scale inference-time compute with advanced planning for language agents? 🙋♂️Short answer: Using your LLM as a world model 💡More detailed answer: Using GPT-4o to predict the outcome of actions on a website can deliver strong performance with improved safety and…