Yusen Zhang
@YusenZhangNLP
PhD Candidate @PennStateEECS | NLP Lab @NLP_PennState #NLProc | Prev Research Intern @MSFTResearch, @AmazonScience @GoogleAI
🚀 How Far Are VLMs from Effective High-Resolution Image Understanding? 👉 We found: Still far. 🆕 Introducing HRScene Benchmark: 📸 25 Real-world Scenes + 🧪 2 Diagnostic NIAH Tests 🏙️ 8 Categories: Daily, Paper, Urban Planning, etc. 🖼️ Resolution: 1,024 × 1,024 ➡️ 35,503 ×…


Our paper VisOnlyQA has been accepted to @COLM_conf #COLM2025! See you in Montreal🍁 We find that even recent Vision Language Models struggle with simple questions about geometric properties in images, such as "What is the degree of angle AOD?"🧐 arxiv.org/abs/2412.00947
📢 New preprint! Do LVLMs have strong visual perception capabilities? Not quite yet... We introduce VisOnlyQA, a new dataset for evaluating the visual perception of LVLMs, but existing LVLMs perform poorly on our dataset. [1/n] arxiv.org/abs/2412.00947 github.com/psunlpgroup/Vi…
We updated our VisOnlyQA paper for #COLM2025! * LVLMs exhibit weak geometric perception even on geometric shapes with 2–3 lines 😭 * Gemini 2.5 Pro largely improves over prior models on charts and chemistry 😳 but still struggles with geometric shapes 😖 arxiv.org/abs/2412.00947
Our paper VisOnlyQA has been accepted to @COLM_conf #COLM2025! See you in Montreal🍁 We find that even recent Vision Language Models struggle with simple questions about geometric properties in images, such as "What is the degree of angle AOD?"🧐 arxiv.org/abs/2412.00947
HRScene got accepted at #ICCV2025! HRScene is a novel unified benchmark for high-resolution image understanding with 25 scenes and 2 NIAH tests. Home page: yszh8.github.io/hrscene/ (Sorry, EvalAI for submission does not work currently...) My PhD research began with long text…
🚀 How Far Are VLMs from Effective High-Resolution Image Understanding? 👉 We found: Still far. 🆕 Introducing HRScene Benchmark: 📸 25 Real-world Scenes + 🧪 2 Diagnostic NIAH Tests 🏙️ 8 Categories: Daily, Paper, Urban Planning, etc. 🖼️ Resolution: 1,024 × 1,024 ➡️ 35,503 ×…
Vision Language Models display a peculiar blind spot: their ability to process image content declines in a U-shaped pattern based on Manhattan distance from corners, suggesting fundamental limitations in handling high-resolution layouts.
NeuroGen: We explored a training-free idea—using prompts to guide large models to generate neural net parameters for downstream tasks. It may ease hyperparam sensitivity & data dependence. Still early, feedback welcome! 📄 arxiv.org/pdf/2505.12470 #FoundationModels #LLM
📢 New paper! FoVer enhances PRMs for step-level verification of LLM reasoning w/o human annotation 🚀 We synthesize training data using formal verification tools and improve LLMs at step-level verification of LLM responses on MATH, AIME, MMLU, BBH, etc. arxiv.org/abs/2505.15960
HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding? "we introduce HRScene, a novel unified benchmark for HRI understanding with rich scenes. HRScene incorporates 25 real-world datasets and 2 synthetic diagnostic datasets with resolutions ranging from…
As Vision Language Models treat images as tokens, high-resolution images create long sequences, similar to long-context challenges in LLMs. In this new paper, we release a benchmark to test VLMs capabilities to understand high-resolution images, up to hundreds of millions of…
🚀 How Far Are VLMs from Effective High-Resolution Image Understanding? 👉 We found: Still far. 🆕 Introducing HRScene Benchmark: 📸 25 Real-world Scenes + 🧪 2 Diagnostic NIAH Tests 🏙️ 8 Categories: Daily, Paper, Urban Planning, etc. 🖼️ Resolution: 1,024 × 1,024 ➡️ 35,503 ×…
Ever wondered how much you can trust a benchmark? We did too - so we built SMART to make them smarter! I will be presenting our work at NAACL along with my coauthors @megnung @DavidPantoja__ See you all in New Mexico!
🚨 New paper alert 🚨 Ever struggled with quick saturation or unreliability in benchmark datasets? Introducing SMART Filtering to select high-quality, reducing dataset size by 48% on avg (up to 68% for ARC!) and improving correlation with scores from ChatBot Arena! 📈✨ (1/N)
Want to learn about fairness in summarization? @HaoyuanLi9 will present our work on fairness in multidocument summarization at #NAACL25 in a couple of hours. This was in collaboration with @YusenZhangNLP and @ruizhang_nlp
Happy to announce that our paper receives SAC award for summarization @naaclmeeting . I will present the paper on Session H: Oral/Poster 5, Thursday May 1, 14:00-15:30.
This work is led by my first PhD student Yusen @YusenZhangNLP, who is graduating soon and actively seeking a postdoc position in academia. It has been an absolute pleasure to work with him. Please consider hiring him!
🚀 How Far Are VLMs from Effective High-Resolution Image Understanding? 👉 We found: Still far. 🆕 Introducing HRScene Benchmark: 📸 25 Real-world Scenes + 🧪 2 Diagnostic NIAH Tests 🏙️ 8 Categories: Daily, Paper, Urban Planning, etc. 🖼️ Resolution: 1,024 × 1,024 ➡️ 35,503 ×…
🎉Our paper on fairness of multidoc summarization has received an SAC award at NAACL 2025! 🥳 We appreciate the recognition from senior area chairs. @HaoyuanLi9 and @YusenZhangNLP will present our work: Posters (Exhibit Hall), Session H: Oral/Poster 5, Thursday May 1,…
🟢 Announcing the #NAACL2025 Award Winners! The Best Paper and Best Theme Paper winners will present at our closing session 2025.naacl.org/blog/best-pape…
I will be at NAACL this week. Welcome to discuss with me if you have any thoughts on this project and all the other research topics!
🚀 How Far Are VLMs from Effective High-Resolution Image Understanding? 👉 We found: Still far. 🆕 Introducing HRScene Benchmark: 📸 25 Real-world Scenes + 🧪 2 Diagnostic NIAH Tests 🏙️ 8 Categories: Daily, Paper, Urban Planning, etc. 🖼️ Resolution: 1,024 × 1,024 ➡️ 35,503 ×…