Tianyu Zheng

@zhengtianyu4

Joined January 2021

16Following

78Followers

Tianyu Zheng Retweeted

Xiang Yue@xiangyue96 · Jul 2

People are racing to push math reasoning performance in #LLMs—but have we really asked why? The common assumption is that improving math reasoning should transfer to broader capabilities in other domains. But is that actually true? In our study (arxiv.org/pdf/2507.00432), we…

127

610

398

58.0K

Tianyu Zheng Retweeted

AK@_akhaliq · Feb 21

SuperGPQA Scaling LLM Evaluation across 285 Graduate Disciplines

26.0K

Tianyu Zheng@zhengtianyu4 · Feb 21

Achieving AGI? Define it first! Thanks for sharing our work!!!

AAK@_akhaliq · Feb 21

SuperGPQA Scaling LLM Evaluation across 285 Graduate Disciplines

1.0K

Tianyu Zheng Retweeted

Ge Zhang@GeZhang86038849 · Dec 18

[1/n] 🎉We are very pleased to introduce FineFineWeb, which is currently the largest open-source fully automatic classification practice for fine-grained web data. Specifically, our contributions are as follows: 🔪We decompose the entire deduplicated version of Fineweb into 67…

163

24.0K

Tianyu Zheng Retweeted

Xiang Yue@xiangyue96 · Dec 10

🚀 Introducing MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale We’re excited to open-source: - 12M MM instruction tuning dataset - MAmmoTH-VL-8B, a SoTA VL model (~10B size) on 20+ downstream tasks compared with fully open-source baselines such as…

103

16.0K

Tianyu Zheng Retweeted

Ge Zhang@GeZhang86038849 · Dec 4

[1/n] 🔥 Happy to Introduce FullStack Bench: A comprehensive evaluation dataset, focusing on full-stack programming across 16 languages and more than 11 real-world application domains like data analysis, software engineering, and machine learning. Whether or not your CodeLLM is…

140

46.0K

Tianyu Zheng Retweeted

Ge Zhang@GeZhang86038849 · Nov 2

[1/n] ### All Current MLLMs Cannot Play Easy Vision-based Games Yet 🚀 Introducing ING-VP — the first Interactive Game-based Vision Planning benchmark, specifically designed to evaluate the spatial imagination and multi-step reasoning abilities of MLLMs. A question really…

2.0K

Tianyu Zheng Retweeted

Ge Zhang@GeZhang86038849 · Nov 1

Glad to introduce a new project of M-A-P： MAP-Daily-Paper: m-a-p.ai/DailyPaper/ Github: github.com/multimodal-art… Our hard-working members (@zhengtianyu4 @wangchunshu @Dudodododo @xiaoy11441629 @KingZhu0210 @yizhilll @MingyangLi71450) from M-A-P community now select papers as…

1.0K

Tianyu Zheng Retweeted

Ge Zhang@GeZhang86038849 · Oct 30

[1/n] ### Discover AutoKaggle: Revolutionizing Data Science Competitions with Multi-Agent Collaboration! 🚀 Introducing AutoKaggle — a multi-agent framework designed to automate the full spectrum of data science competitions on Kaggle! From background understanding to model…

154

15.0K

Tianyu Zheng Retweeted

Siwei Wu（吴思为）@siweiwu7 · Oct 20

1/n Excited to announce the release of our new paper "A Comparative Study on Reasoning Patterns of OpenAI's o1 Model" arxiv.org/pdf/2410.13639 To advance the open-source community's reproduction of the o1 model, we collaborated with the OpenO1 Team (github.com/Open-Source-O1…).

261

241

53.0K

Tianyu Zheng Retweeted

Ge Zhang@GeZhang86038849 · Aug 22

[1/n] Are LLMs capable of self-alignment with nearly nothing? (Only a few instructions without responses) We reask the important research question by introducing I-SHEEP! LLMs, like humanity, have the potential to achieve self-alignment with very weak supervisions. Self-Assesment…

5.0K

Tianyu Zheng Retweeted

Ge Zhang@GeZhang86038849 · Jun 30, 2024

[1/n] New Benchmark Alert! LongIns (arxiv.org/pdf/2406.17588) is a little "brother" of LongICLBench (arxiv.org/pdf/2404.02060), but it provides a possibility of more dynamically verifying LLM's long-context reasoning performance. Each sample in LongIns is composed of multiple…

988