Haiyang Wang
@haiyang73756134
PhD Student @pku1898, focusing on topics in the foundation model and network architecture designing. Work for AGI.
๐ Introducing NSA: A Hardware-Aligned and Natively Trainable Sparse Attention mechanism for ultra-fast long-context training & inference! Core components of NSA: โข Dynamic hierarchical sparse strategy โข Coarse-grained token compression โข Fine-grained token selection ๐ก Withโฆ
TokenFormer, a new model architecture from @cvml_mpiinf and @PKU1898, scales from 124M to 1.4B parameters by treating parameters as tokens, maintaining Transformer performance with lower cost. Talk to the team @haiyang73756134 @ferjadnaeem @xyongqin @janericlenssen @fedassa here!
Our GiT got an ๐จ๐ซ๐๐ฅ ๐ฉ๐ซ๐๐ฌ๐๐ง๐ญ๐๐ญ๐ข๐จ๐ง at #ECCV2024 . See you at #ECCV2024. Paper: GiT: Towards Generalist Vision Transformer through Universal Language Interface arxiv.org/pdf/2403.09394 Code: github.com/Haiyang-W/GiT (please star it if it's helpful๐)

#ICLR2024 Arrived Vienna! Happy to share our recent work ๐๐ผ๐๐ฎ๐ฟ๐ฑ๐ ๐ฒ๐ณ๐ณ๐ถ๐ฐ๐ถ๐ฒ๐ป๐ ๐ฎ๐ป๐ฑ ๐ฒ๐ณ๐ณ๐ฒ๐ฐ๐๐ถ๐๐ฒ ๐ด๐ฒ๐ผ๐บ๐ฒ๐๐ฟ๐ถ๐ฐ ๐ฑ๐ฒ๐ฒ๐ฝ ๐น๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐ณ๐ผ๐ฟ ๐๐ฐ๐ถ๐ฒ๐ป๐ฐ๐ฒ! With incredible CTL and @ask1729! May 9 10:45am-12:45am (Poster254, Halle B). Detailsโฌ๏ธ (1/n)
#ICLR2024 Just arrived in Vienna! Don't miss our oral presentation tomorrow afternoon in room Halle A3, focusing on ๐๐ก๐ก๐ and their ๐ฒ๐ ๐ฝ๐ฟ๐ฒ๐๐๐ถ๐๐ฒ ๐ฝ๐ผ๐๐ฒ๐ฟ! Also, swing by our poster session (Poster272, Halle B). See you there! ๐
Visual Autoregressive Modeling Scalable Image Generation via Next-Scale Prediction We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution
To reduce human bias in model architecture, we propose a simple, yet effective LLM-like visual framework, called GiT, applicable for various vision tasks (e.g., VL tasks and segmentation) only with a vanilla ViT. :) Code: github.com/Haiyang-W/GiT arxiv.org/abs/2403.09394
GiT Towards Generalist Vision Transformer through Universal Language Interface This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT. Motivated by the universality of the Multi-layer