Wenhao Chai
@wenhaocha1
Ph.D. Student @PrincetonCS. Prev @UW @Stanford @pika_labs @MSFTResearch @UofIllinois @ZJU_China. I work on computer vision, but it's not all I do.
This appears to be a well-defined and good problem. Take a look!
We're also introducing a new interpretability track (more details soon) and two guest tracks: 1. KiVA image understanding: like ARC-AGI but grounded in cog sci w/ difficulty levels 2. Physics-IQ video generation: can your img2video model generate physically plausible scenes?
Go LONG VIDEO! Our MovieChat in early 2023 just build a very naive prototype for memory-augmented long video context understanding. Super excited to see it comes true at scale and in application. github.com/rese1f/MovieCh…
I’m Shawn, founder of Memories.ai, former researcher at Meta and CS PhD at University of Cambridge. Today we’re launching : we built the world’s first Large Visual Memory Model - to give AI human-like visual memories. Why visual memory? AI to…
Dataset Distillation as Data Compression: A Rate-Utility Perspective arxiv.org/abs/2507.17221 Read this paper tonight, get me some sense: Dataset Distillation ≈ Visual Tokenization? Dataset Distillation: Replace full dataset with few synthetic samples Visual Tokenizer: Replace…
This amazing team from Kuaishou did a good job in LiveCodeBench Pro: 40B model but almost matches o3-mini performance. Take a look at their tech report! leaderboard: livecodebenchpro.com
🚀 Excited to introduce KAT-V1 (Kwaipilot-AutoThink) – a breakthrough 40B large language model from the Kwaipilot team! KAT-V1 dynamically switches between reasoning and non-reasoning modes to address the “overthinking” problem in complex reasoning tasks. Key Highlights: 📌 40B…
arxiv.org/abs/2507.13338 This is really a great paper connect multiple concept like spectral norm, res connection, and layer norm and more techs under Lipschitz condition. truly well-written and easy to follow.
From formal language to natural language and yet still made remarkable progress! That’s way far better than what I could do now (and past), I’ve been out of math competitions for ages. so the question is: is natural language actually better than lean, or they just try to build…
An advanced version of Gemini with Deep Think has officially achieved gold medal-level performance at the International Mathematical Olympiad. 🥇 It solved 5️⃣ out of 6️⃣ exceptionally difficult problems, involving algebra, combinatorics, geometry and number theory. Here’s how 🧵
Workshop Highlights from @CVPR LOVE @CVPR'25 Challenge wrapped up with incredible participation 🎉 🔹 Academic talks from leading researchers 🔹 Winners crowned in both Track 1A & 1B 🔹 Prizes awarded by @LambdaAPI Reports from winners is now live! Check it out 🔗 Track…
We present DreamOn: a simple yet effective method for variable-length generation in diffusion language models. Our approach boosts code infilling performance significantly and even catches up with oracle results.
🚀 Thrilled to announce Dream-Coder 7B — the most powerful open diffusion code LLM to date.
We should also turn our attention to the Dream series — an amazing research group that's steadily building the foundation for dLLMs.
What happend after Dream 7B? First, Dream-Coder 7B: A fully open diffusion LLM for code delivering strong performance, trained exclusively on public data. Plus, DreamOn cracks the variable-length generation problem! It enables code infilling that goes beyond a fixed canvas.
great work, always like controlled (even just toy) experiments. I'm afraid we can't have OOD generalization in current overparameterization ML system. I’ve recently grown fond of pattern/concept narratives. In fact, I don’t believe that data-driven neural networks are capable of…
Our paper aims to answer two questions: 1. What's the difference between prediction and world models? 2. Are there straightforward metrics that can test this distinction? Our paper is about AI. But it's helpful to go back 400 years to answer these questions.
Single-pass Adaptive Image Tokenization for Minimum Program Search arxiv.org/abs/2507.07995 Find this paper super impressive and their team have been working on meaningful vision tokenzier for a long time! What I take from this paper: not all images have the same complexity, we…

As an amateur photographer, I can’t wait to see more agents help with retouching and postprocessing!
🤨Ever dream of a tool that can magically restore and upscale any (low-res) photo to crystal-clear 4K? 🔥Introducing "4KAgent: Agentic Any Image to 4K Super-Resolution", the most capable upscaling generalist designed to handle broad image types. 🔗4kagent.github.io 1/🧵
📣 Excited to announce SpaVLE: #NeurIPS2025 Workshop on Space in Vision, Language, and Embodied AI! 👉 …vision-language-embodied-ai.github.io 🦾Co-organized with an incredible team → @fredahshi · @maojiayuan · @DJiafei · @ManlingLi_ · David Hsu · @Kordjamshidi 🌌 Why Space & SpaVLE? We…
Very good ckpt hub for hybrid design research. Thanks!!
Hybrid architectures mix linear & full attention in LLMs. But which linear attention is best? This choice has been mostly guesswork. In our new work, we stop guessing. We trained, open-sourced 72 MODELS (340M & 1.3B) to dissect what truly makes a hybrid model tick🧶
Great work in hybrid model design in reasoning tasks! Compared to natural language, reasoning process is far more dense and informative, which is more challenging and meaningful.
Reasoning can be made much, much faster—with fundamental changes in neural architecture. 😮 Introducing Phi4-mini-Flash-Reasoning: a 3.8B model that surpasses Phi4-mini-Reasoning on major reasoning tasks (AIME24/25, MATH500, GPQA-D), while delivering up-to 10× higher throughput…
I’ve been thinking about whether it’s possible to charge a submission fee for each paper, which would be refunded once the author successfully completes their reviewing duties. Any surplus could be used to reward outstanding reviewers. This would not only incentivize a better…
Some say reviewing should be voluntary, so authors shouldn't be obligated to review. But authors also receive reviews as a free service—so we should give back, especially given the growing number of submissions. I support requiring authors to review or opt for a buyout (e.g.,…
Just created a Gallery to display all generation results on RISEBench (by powerful models including GPT-4o Image, Gemini-2.0, Bagel, etc.). Please contact me if you want the results of your new model to be included! Tech Report: arxiv.org/abs/2504.02826
OpenCompass just released RISEBench, the first benchmark on Reasoning-Informed Visual Editing (RISE). GPT-4o Image Generation only scores 36% on this challenging task! Technical Report: huggingface.co/papers/2504.02… #GPT4o
I believe if we view xsfm in associative memory, then the next token prediction is a single-step energy model, if it’s UT or depth recurrent, it’s a multi-step energy model. another interesting things is that when we do multi-step like UT, it’s very hard to optimize via…
How can we unlock generalized reasoning? ⚡️Introducing Energy-Based Transformers (EBTs), an approach that out-scales (feed-forward) transformers and unlocks generalized reasoning/thinking on any modality/problem without rewards. TLDR: - EBTs are the first model to outscale the…
Find this paper really insightful. arxiv.org/abs/2507.02754 By definition, 0-simplex is linear model, 1-simplex is xsfm. N-simplex means at the operator each token has the relation with N other tokens group. I found alphafold use 1- and 2-simplex hybrid for their data. So I really…