Tianyu Zheng
@zhengtianyu4
People are racing to push math reasoning performance in #LLMs—but have we really asked why? The common assumption is that improving math reasoning should transfer to broader capabilities in other domains. But is that actually true? In our study (arxiv.org/pdf/2507.00432), we…
Achieving AGI? Define it first! Thanks for sharing our work!!!
SuperGPQA Scaling LLM Evaluation across 285 Graduate Disciplines
[1/n] 🎉We are very pleased to introduce FineFineWeb, which is currently the largest open-source fully automatic classification practice for fine-grained web data. Specifically, our contributions are as follows: 🔪We decompose the entire deduplicated version of Fineweb into 67…
🚀 Introducing MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale We’re excited to open-source: - 12M MM instruction tuning dataset - MAmmoTH-VL-8B, a SoTA VL model (~10B size) on 20+ downstream tasks compared with fully open-source baselines such as…
[1/n] 🔥 Happy to Introduce FullStack Bench: A comprehensive evaluation dataset, focusing on full-stack programming across 16 languages and more than 11 real-world application domains like data analysis, software engineering, and machine learning. Whether or not your CodeLLM is…
[1/n] ### All Current MLLMs Cannot Play Easy Vision-based Games Yet 🚀 Introducing ING-VP — the first Interactive Game-based Vision Planning benchmark, specifically designed to evaluate the spatial imagination and multi-step reasoning abilities of MLLMs. A question really…
Glad to introduce a new project of M-A-P: MAP-Daily-Paper: m-a-p.ai/DailyPaper/ Github: github.com/multimodal-art… Our hard-working members (@zhengtianyu4 @wangchunshu @Dudodododo @xiaoy11441629 @KingZhu0210 @yizhilll @MingyangLi71450) from M-A-P community now select papers as…
[1/n] ### Discover AutoKaggle: Revolutionizing Data Science Competitions with Multi-Agent Collaboration! 🚀 Introducing AutoKaggle — a multi-agent framework designed to automate the full spectrum of data science competitions on Kaggle! From background understanding to model…
1/n Excited to announce the release of our new paper "A Comparative Study on Reasoning Patterns of OpenAI's o1 Model" arxiv.org/pdf/2410.13639 To advance the open-source community's reproduction of the o1 model, we collaborated with the OpenO1 Team (github.com/Open-Source-O1…).
[1/n] Are LLMs capable of self-alignment with nearly nothing? (Only a few instructions without responses) We reask the important research question by introducing I-SHEEP! LLMs, like humanity, have the potential to achieve self-alignment with very weak supervisions. Self-Assesment…
[1/n] New Benchmark Alert! LongIns (arxiv.org/pdf/2406.17588) is a little "brother" of LongICLBench (arxiv.org/pdf/2404.02060), but it provides a possibility of more dynamically verifying LLM's long-context reasoning performance. Each sample in LongIns is composed of multiple…