Fig. 1 (IMAGE)
Caption
Comparison of gaze coordinates between human participants and attention heads of vision transformers (ViTs)
Video clips from N2010 (Nakano et al., 2010) and CW2019 (Costela and Woods, 2019) were presented to ViTs. The gaze positions of each self-attention head in the class token ([CLS]) — identified as peak positions within the self-attention map directed at patch token — were compared with human gaze positions from the respective datasets. There were six ViT models with varying numbers of layers (L = 4, 8, or 12), trained either by supervised learning (SL) or self-supervised learning using the DINO method.
Credit
Reproduced from Yamamoto K, Akahoshi H, Kitazawa S. (2025). Emergence of human-like attention in self-supervised Vision Transformers: an eye-tracking study. Neural Networks.
Usage Restrictions
Credit must be given to the creator.
License
CC BY