Article Highlight | 27-Sep-2024

Rethinking global context in crowd counting

Beijing Zhongke Journal Publising Co. Ltd.

Network overview — **image:**
The input image is first split into overlapping patches. Then, those patches go through tokens reduction block and main transformer to learn features with global information. To abstract global information, context token (blue vector) is added to the input sequence before the main transformer. The encoded features are processed by TAM and regression-token module (RTM). The small decoder after TAM is not shown for simplicity.
view more

Credit: Beijing Zhongke Journal Publising Co. Ltd.

At first sight, counting the size of a crowd presented in an image is equivalent to the problem of detecting and counting of person instances. Such direct approaches however have been shown not to perform well, because generic detectors suffer from the small instance size and severe occlusions presented in crowded regions typically a person covers only a small number of pixels, and only few body parts are visible (often just the head).

State-of-the-art crowd counting approaches therefore rely on the prediction of crowd density maps, a localized, pixel-wise measure of person presence. To this end, underlying network architectures need to integrate context across location and scales. This is crucial due to the vast variety of possible appearances of a given crowd density. In other words, the ability to integrate a large context makes it possible to adapt the density estimation to an expectation raised by the given scene, beyond the tunnel vision of local estimation. Geometry and semantics are two of the main aspects of scene context, that can serve this goal for crowd counting. Unfortunately, even if we manage to model and represent such knowledge, it is very cumbersome to obtain, and therefore not practical for many applications of image-based crowd counting. This also reflects the setup of the most popular crowd counting challenge datasets considered in this paper.

On the bright side, even in the absence of such direct knowledge, people can benefit from the recent progress in geometric and semantic learning on a conceptual level–by studying the inductive biases. In fact, the development of computer vision in the last decade demonstrated the possibility to implicitly learn representations capturing rich geometric and semantic information from a single image. Recently, the advantageous nature of global interaction over convolutional neural networks (CNNs) has been demonstrated for both geometric features for monocular depth prediction, as well as for semantic features in segmentation. The aforementioned works attribute the success of the transformer to global receptive fields, which has been a bottleneck in previous CNN-based approaches. Moreover, CNNs by design apply the same operation on all locations, rendering it a sub-optimal choice for exploiting information about the geometric and semantic composition of the scene.

As geometric and semantic understanding are crucial aspects of scene context for the task of crowd counting, this paper hypothesizes that superior capabilities of transformers on these aspects are also indicative of a more suitable inductive bias for crowd counting. To investigate this hypothesis, this paper adapt the vision transformers for the task of crowd counting.

Unlike image classification, crowd counting is a dense prediction task. Following the previous discussion, the learning of crowd counting is also predicated on the global context of the image. To capture both spatial information for dense prediction, as well as the necessary scene context, the paper maintains both local tokens (representing image patches) and a context token (representing image context). A token attention module (TAM) then is introduced to refine the encoded features informed by the context token. The learning of the context token is further guided by using a regression token module (RTM), which accommodates an auxiliary loss on the regression of the total count of the crowd. Following the refined transformer output is then mapped to the desired crowd density map using two convolution layers. Please refer to Fig. 1 for an illustration of the overall framework.

In particular, the proposed TAM is designed to address the observation that the multi-head self-attention (MHSA) in vision transformers only models spatial interactions, while the tried-and-true channel-wise interactions have also been proved to be of vital effectiveness. To this end, TAM imprints the context token on the local tokens by conditional recalibration of feature channels, therefore explicitly modelling channel-wise interdependencies. Current widely-used methods to achieve this goal include SENet and CBAM. They use simple aggregation technique such as global average pooling or global maximum pooling on the input features to obtain channel-wise statistics (global abstraction), which are then used to capture channel-wise dependencies. For transformers, this paper proposes a natural and elegant way to model channel relationships by extending the input sequence with a context token and introducing the TAM to recalibrate local tokens through channel-wise attention informed by the context token. The additional attention across feature channels further facilitates the learning of global context.

This paper also adopts context token which interacts with other patch tokens throughout the transformers to regress the total crowd count of the whole image. This is achieved by the proposed RTM, containing a two-layer multi-layer perceptron (MLP). On the one hand, the syzygy of TAM and RTM forces the context token to collect and distribute image-level count estimates from and to all local tokens, leading to a better representation of context token. On the other hand, it helps to learn better underlying features for the task and reduce overfitting within the network, similar to auxiliary-task learning.

In summary, this paper provides another perspective on density-supervised crowd counting, through the lens of learning features with global context. Specifically, this paper introduces a context token tasked with the refinement of local feature tokens through a novel framework of token-attention and regression-token modules. This framework thereby addresses the shortcomings of CNNs with regards to capturing global context for the problem of crowd counting. We conduct experiments on various popular datasets, including ShanghaiTech, UCF-QNRF, JHU-CROWD++ and NWPU. The experimental results demonstrate that the proposed context extraction techniques can significantly improve the performance over the baselines and thus open a new path for crowd counting.

See the article:

Sun, G., Liu, Y., Probst, T. et al. Rethinking Global Context in Crowd Counting. Mach. Intell. Res. 21, 640–651 (2024). https://doi.org/10.1007/s11633-023-1475-z

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.