Fig. 3 (IMAGE)
Caption
Attention heads in DINO-trained ViTs were grouped into three categories
In the DINO-trained ViT12 model, 144 attention heads from layers exhibiting human-like attention were classified using multidimensional scaling based on attention-to-gaze distances across many images. The heads were classified into three distinct groups: G1 focused on the center of figures (e.g., faces), G2 focused on figure outlines (e.g., whole bodies), and G3 focused on the ground (background).
Credit
Adapted from Yamamoto K, Akahoshi H, Kitazawa S. (2025). Emergence of human-like attention in self-supervised Vision Transformers: an eye-tracking study. Neural Networks.
Usage Restrictions
Credit must be given to the creator.
License
CC BY