Vision transformers with hierarchical attention
Beijing Zhongke Journal Publising Co. Ltd.
image: The overall architecture of HAT-Net is illustrated in Fig. 1.
Credit: Beijing Zhongke Journal Publising Co. Ltd.
In the last decade, convolutional neural networks (CNNs) have been the go-to architecture in computer vision, owing to their powerful capability in learning representations from images/videos. Meanwhile, in another field of natural language processing (NLP), the transformer architecture has been the de-facto standard to handle long-range dependencies. Transformers rely heavily on self-attention to model global relationships of sequence data. Although global modeling is also essential for vision tasks, the 2D/3D structures of vision data make it less straightforward to apply transformers therein. This predicament has been recently broken by Dosovitskiy et al., by applying a pure transformer to sequences of image patches.
Motivated by Dosovitskiy et al, a large amount of literature on vision transformers has emerged to resolve the problems caused by the domain gap between computer vision and NLP. From this paper’s point of view, one major problem of vision transformers is that the sequence length of image patches is much longer than that of tokens (words) in an NLP application, thus leading to high computational/ space complexity when computing the multi-head self-attention (MHSA). Some efforts have been dedicated to resolving this problem.
ToMe improves the throughput of existing ViT models by systematically merging similar tokens through the utilization of a general and light-weight matching algorithm. Pyramid vision transformer (PVT) and multiscale vision transformer (MViT) downsample the feature to compute attention in a reduced length of tokens but at the cost of losing fine-grained details. Swin transformer computes attention within small windows to model local relationships, and it gradually enlarges the receptive field by shifting windows and stacking more layers. From this point of view, Swin transformer may still be suboptimal because it works in a similar manner to CNNs and needs many layers to model long-range dependencies.
Building upon the discussed strengths of downsampling-based transformers and window-based transformers, each with its distinctive merits, the researchers aim to harness their complementary advantages. Downsampling-based transformers excel at directly modeling global dependencies but may sacrifice fine-grained details, while window-based transformers effectively capture local dependencies but may fall short in global dependency modeling. As widely accepted, both global and local information is essential for visual scene understanding. Motivated by this insight, the research seeks to amalgamate the strengths of both paradigms, enabling the direct modeling of both global and local dependencies.
To achieve this, this paper introduces the hierarchical multi-head self-attention (H-MHSA), a novel mechanism that enhances the flexibility and efficiency of self-attention computation in transformers. This paper’s methodology begins by segmenting an image into patches, treating each patch akin to a token. Rather than computing attention across all patches, the researchers further organize these patches into small grids, performing attention computation within each grid. This step is instrumental in capturing local relationships and generating more discriminative local representations.
Subsequently, the researchers amalgamate these smaller patches into larger ones and treat the merged patches as new tokens, resulting in a substantial reduction in their number. This enables the direct modeling of global dependencies by calculating self-attention for the new tokens. Ultimately, the attentive features from both local and global hierarchies are aggregated to yield potent features with rich granularities. Notably, as the attention calculation at each step is confined to a small number of tokens, the proposed hierarchical strategy mitigates the computational and space complexity of vanilla transformers. Empirical observations underscore the efficacy of this hierarchical self-attention mechanism, revealing improved generalization results in experiments.
By simply incorporating H-MHSA, the researchers build a family of hierarchical-attention-based transformer networks (HAT-Net). To evaluate the efficacy of HAT-Net in scene understanding, the researchers experiment HAT-Net for fundamental vision tasks, including image classification, semantic segmentation, object detection and instance segmentation. Experimental results demonstrate that HAT-Net performs favorably against previous backbone networks. Note that H-MHSA is based on a very simple and intuitive idea, so H-MHSA is expected to provide a new perspective for the future design of vision transformers.
See the article:
Liu, Y., Wu, YH., Sun, G. et al. Vision Transformers with Hierarchical Attention. Mach. Intell. Res. 21, 670–683 (2024). https://doi.org/10.1007/s11633-024-1393-8
Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.