Fig. 2 (IMAGE)
Caption
Attention of DINO-trained ViTs closely resembles that of humans
Top: In a scene depicting a conversation between two children, the human gaze is predominantly directed toward the face of the child on the right (left). A ViT trained with the DINO method focuses on the face of the child on the right (center). In contrast, the gaze of a ViT trained with supervised learning (SL) is scattered (right).
Bottom: The distance between each attention head and human gaze was quantified layer by layer. In the DINO-trained ViT, heads that exhibited attention patterns similar to human gaze emerged in layers 9 and 10.
Credit
Adapted from Yamamoto K, Akahoshi H, Kitazawa S. (2025). Emergence of human-like attention in self-supervised Vision Transformers: an eye-tracking study. Neural Networks.
Usage Restrictions
Credit must be given to the creator.
License
CC BY