Haodong Duan
@KennyUTC
Researcher @ Shanghai AI Lab, working on multi-modal learning B.S. @PKU1898 / Ph.D. @CUHKofficial Built #VLMEvalKit for MLLM evaluation
OpenCompass just released RISEBench, the first benchmark on Reasoning-Informed Visual Editing (RISE). GPT-4o Image Generation only scores 36% on this challenging task! Technical Report: huggingface.co/papers/2504.02… #GPT4o




families may have been divided but the world is united
Just created a Gallery to display all generation results on RISEBench (by powerful models including GPT-4o Image, Gemini-2.0, Bagel, etc.). Please contact me if you want the results of your new model to be included! Tech Report: arxiv.org/abs/2504.02826
OpenCompass just released RISEBench, the first benchmark on Reasoning-Informed Visual Editing (RISE). GPT-4o Image Generation only scores 36% on this challenging task! Technical Report: huggingface.co/papers/2504.02… #GPT4o
- VisualPRM for Test Time Scaling of Visual Reasoning Problems: arxiv.org/abs/2503.10291 - 5%~10% Avg. Accuracy Improvement over 7 mainstream benchmarks. - This work is released with 400K Tuning Data & 3K Benchmark Problems

We just added SmolVLM2 to VLMEvalKit - now it is easier to evaluate your fine-tunes 🥰😊
we are analyzing the top papers on @huggingface (~4000 papers mostly related to LLMs) and here is a list of the top 20 authors with the most papers published in less than 2 years. all of them Asian! (not necessarily in Asia) this is no competition, these alphas OWN the game.
Lame
DeepSeek is a wake up call for America, but it doesn’t change the strategy: - USA must out-innovate &race faster, as we have done in the entire history of AI - Tighten export controls on chips so that we can maintain future leads Every major breakthrough in AI has been American
After 1yr of Building VLMEvalKit now reaches 100+ Contributors On the journey of exploring LMM capabilities, we will go further github.com/open-compass/V…

OpenCompass has established a leaderboard to evaluate complex reasoning capability of LMMs, consisting of four advanced multi-modal math reasoning benchmarks. Currently, Gemini-2.0-Flash took the 1st place. DM me to suggest more benchmarks and models to this LB.

Real Research :lol
As my kids are singing APT non-stop these days, I did a bit of reverse engineering of the APT music video and tried to understand why the MV is so addictive. Here is what I learned.
Mitigating racial bias from LLMs is a lot easier than removing it from humans! Can’t believe this happened at the best AI conference @NeurIPSConf We have ethical reviews for authors, but missed it for invited speakers? 😡
I HATE this website
请教各位,如何在谷歌搜索页面中一劳永逸地杜绝“CSDN博客”的结果出现?我真是被烦死了。
WILD! Some researchers from have republished ResNet under their own names at some predatory journal. @CVPR
The figure is very ``apple''
Apple released AIMv2 🍏 a family of state-of-the-art open-set vision encoders > like CLIP, but add a decoder and train on autoregression 🤯 > 19 open models come in 300M, 600M, 1.2B, 2.7B with resolutions of 224, 336, 448 > Loadable and usable with 🤗 transformers
Ovis demonstrates impressive performance on the Open VLM Leaderboard among the light-weight (< 10B) VLMs. You can check the results here: huggingface.co/spaces/opencom…
Curious about the story behind OVIS? They're doing a broadcast now. Check it out and ask questions :) x.com/i/broadcasts/1…