Kwang Moo Yi
@kwangmoo_yi
Assistant Professor of Computer Science at the University of British Columbia. I also post my daily finds on arxiv.
Paper of (not) today: Violante et al., "Splat and Replace: 3D Reconstruction with Repetitive Elements" -- repo-sam.inria.fr/nerphys/splat-… Various human-made scenes are repetitive -- you can use this to fake multiple views. Reminds me of fractal-based ideas from the past!
Preprint of (not) today: Bohacek and Fel et al., "Uncovering Conceptual Blindspots in Generative Image Models Using Sparse Autoencoders" -- arxiv.org/abs/2506.19708 What are some things that text-to-image generators cannot generate? An interesting systematic way to look into it.

Preprint of (not) today: Lin and Lin et al., "MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second" -- chenguolin.github.io/projects/MoVie… Feed-forward VGGT + Splats/Motion estimation heads, trained also with rendering & motion estimation losses. Multitask training improves all.
Preprint of today: Walker et al., "Generalist Forecasting with Frozen Video Models via Latent Diffusion" -- arxiv.org/abs/2507.13942 Maybe not surprising, but also very interesting -- learning to forecast strongly correlates with generalization. Like how LLMs came to be.

Preprint of (not) today: Sun et al., "3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds" -- ai.stanford.edu/~sunfanyun/3d-… An agentic VLM-based pipeline to generate 3D scenes (with Blender and assets) that adhere to prompts.
Paper of today: Xiao et al., "SpatialTrackerV2: 3D Point Tracking Made Easy" -- spatialtracker.github.io Learned modules dedicated to video depth, camera pose, and a refinement module using 2D/3D tracks. Joint refinement seems to help, but hard to conclude due to complexity.
Preprint of today: Zhuo and Zheng et al., "Streaming 4D Visual Geometry Transformer" -- wzzheng.net/StreamVGGT/ VGGT with cache / causal attention for 70ms inference on an image stream. Similar to other Dust3R speed-up methods, but with VGGT.

Preprint of today: Li and Yi and Liu et al., "Cameras as Relative Positional Encoding" -- liruilong.cn/prope/ Conditioning cameras (and rays) with Plucker raymaps have been a breakthrough for enabling 3D for transformer-based models. How you encode this matters a lot.
For everyone interested in precise 📷camera control 📷 in transformers [e.g., video / world model etc] Stop settling for Plücker raymaps -- use camera-aware relative PE in your attention layers, like RoPE (for LLMs) but for cameras! Paper & code: liruilong.cn/prope/
Preprint of today: Want et al., "CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering" -- clift-nvs.github.io Transformer-based (the typical Plucker-based) novel view synthesis + tokenization, or the way I put it, keypoints (anchors)!
Preprint of today: Tuli and Kamali and Lindell, "Generative Panoramic Image Stitching" -- arxiv.org/abs/2507.07133 Harmonizing, smoothing images is non-trivial, so let's use a diffusion model to do this -- in/out-paint to do panoramic stitching

Preprint of today: Hu et al., "Kernel Density Steering: Inference-Time Scaling via Mode Seeking for Image Restoration" -- arxiv.org/abs2507.05604 Even with sampling methods, all we often want is the mode. So let's solve inverse problem with KDE w/ Diffusion Models?

Kwang Moo Yi @kwangmoo_y x.com/MardaniMorteza…
📢 Test-time Scaling of SDE Diffusion Models Does optimizing the noise trajectory improve sample quality? Significantly. We propose ϵ-greedy search, a simple contextual bandit method matching optimal MCTS in noise space. 📄 arxiv.org/pdf/2506.03164 💻 github.com/rvignav/diffus…
Preprint of (not) today: Jain et al., "Diffusion Tree Sampling: Scalable inference-time alignment of diffusion models" -- diffusion-tree-sampling.github.io Not all denoising trajectories are equal. MCTS search can be used to efficiently and effectively search for the best one.
Preprint of (not) today: Ma et al., "Puzzles: Unbounded Video-Depth Augmentation for Scalable End-to-End 3D Reconstruction" -- jiahao-ma.github.io/puzzles/ Simple and effective idea. Crop and warp to different views (with incompletions) for data augmentation -- more 3D supervision.
Come work with us! The Machine Learning Research (MLR) team at Apple is seeking a passionate AI researcher to work on Efficient ML algorithms: jobs.apple.com/en-us/details/…
Preprint of today: Liang et al., "Learning Dense Feature Matching via Lifting Single 2D Image to 3D Space" -- arxiv.org/abs/2507.00392 Train an encoder for multi-view consistent encodings --> train a decoder for robust matching features. Cool idea, but some overclaims in paper.

Preprint of (not) today: Xu et al., "4DGT Learning a 4D Gaussian Transformer Using Real-World Monocular Videos" -- 4dgt.github.io A feed-forward dynamic Gaussian Splatting estimator trained with monocular videos. 2-stage coarse-to-fine training + mono-depth guidance.
Preprint of today: Vavilala et al., "Generative Blocks World: Moving Things Around in Pictures" -- arxiv.org/abs/2506.20703 I have a soft spot for reviving old ideas in modern methods -- block world via primitives now with Diffusion models for generating/editing images.

Preprint of today: Vontobel et al., "HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling" -- arxiv.org/abs/2506.20452 This one also uses wavelets and frequency-dependent guidance. Generate->Invert->Add high-freq details for each part.

Preprint of today: Sadat et al., "Guidance in the Frequency Domain Enables High-Fidelity Sampling at Low CFG Scales" -- arxiv.org/abs/2506.19713 Classifier Free Guidance affects generation differently at different frequencies -- set it to low for low frequency, and high for high!
