News Release 26-May-2025

Self-trained vision transformers mimic human gaze with surprising precision

A research team led by The University of Osaka has demonstrated that vision transformers using self-attention mechanisms can spontaneously develop visual attention patterns similar to humans without specific training

Peer-Reviewed Publication

The University of Osaka

Fig. 1 — **image:**
**Comparison of gaze coordinates between human participants and attention heads of vision transformers (ViTs)**

Video clips from N2010 (Nakano et al., 2010) and CW2019 (Costela and Woods, 2019) were presented to ViTs. The gaze positions of each self-attention head in the class token ([CLS]) — identified as peak positions within the self-attention map directed at patch token — were compared with human gaze positions from the respective datasets. There were six ViT models with varying numbers of layers (L = 4, 8, or 12), trained either by supervised learning (SL) or self-supervised learning using the DINO method.
view more

Credit: Reproduced from Yamamoto K, Akahoshi H, Kitazawa S. (2025). Emergence of human-like attention in self-supervised Vision Transformers: an eye-tracking study. Neural Networks.

Osaka, Japan – Can machines ever see the world as we see it? Researchers have uncovered compelling evidence that vision transformers (ViTs), a type of deep-learning model that specializes in image analysis, can spontaneously develop human-like visual attention patterns when trained without labeled instructions.

Visual attention is the mechanism by which organisms, or artificial intelligence (AI), filter out ‘visual noise’ to focus on the most relevant parts of an image or view. While natural for humans, spontaneous learning has proven difficult for AI. However, researchers have revealed, in their recent publication in Neural Networks, that with the right training experience, AI can spontaneously acquire human-like visual attention without being explicitly taught to do so.

The research team, from The University of Osaka, compared human eye-tracking data to attention patterns generated by ViTs trained using DINO (‘self-distillation with no labels’), a method of self-supervised learning that allows models to organize visual information without annotated datasets. Remarkably, the DINO-trained ViTs exhibited gaze behavior that closely mirrored that of typically developing adults when viewing dynamic video clips. In contrast, ViTs trained with conventional supervised learning showed unnatural visual attention.

“Our models didn’t just attend to visual scenes randomly, they spontaneously developed specialized functions,” says Takuto Yamamoto, lead author of the study. “One subset of the model consistently focused on faces, another captured the outlines of entire figures, and a third attended primarily to background features. This closely reflects how human visual systems segment and interpret scenes.”

Through detailed analyses, the team demonstrated that these attention clusters emerged naturally in the DINO-trained ViTs. These attention patterns were not only qualitatively similar to the human gaze, but also quantitatively aligned with established eye-tracking data, particularly in scenes involving human figures. The findings suggest a possible extension of the traditional, two-part figure–ground model of perception in psychology into a three-part model.

“What makes this result remarkable is that these models were never told what a face is,” explains senior author, Shigeru Kitazawa, “Yet they learned to prioritize faces, probably because doing so maximized the information gained from their environment. It is a compelling demonstration that self-supervised learning may capture something fundamental about how intelligent systems, including humans, learn from the world.”

The study underscores the potential of self-supervised learning not only for advancing AI applications, but also for modeling aspects of biological vision. By aligning artificial systems more closely with human perception, self-supervised ViTs offer a new lens for interpreting both machine learning and human cognition. The findings of this study could be used for a variety of applications, such as the development of human-friendly robots or to enhance support during early childhood development.

###

The article “Emergence of Human-Like Attention and Distinct Head Clusters in Self-Supervised Vision Transformers: A Comparative Eye-Tracking Study” has been published in Neural Networks at DOI: https://doi.org/10.1016/j.neunet.2025.107595

About The University of Osaka

The University of Osaka was founded in 1931 as one of the seven imperial universities of Japan and is now one of Japan's leading comprehensive universities with a broad disciplinary spectrum. This strength is coupled with a singular drive for innovation that extends throughout the scientific process, from fundamental research to the creation of applied technology with positive economic impacts. Its commitment to innovation has been recognized in Japan and around the world, being named Japan's most innovative university in 2015 (Reuters 2015 Top 100) and one of the most innovative institutions in the world in 2017 (Innovative Universities and the Nature Index Innovation 2017). Now, Osaka University is leveraging its role as a Designated National University Corporation selected by the Ministry of Education, Culture, Sports, Science and Technology to contribute to innovation for human welfare, sustainable development of society, and social transformation.

Website: https://resou.osaka-u.ac.jp/e

Journal

Neural Networks

DOI

10.1016/j.neunet.2025.107595

Method of Research

Computational simulation/modeling

Subject of Research

People

Article Title

Emergence of Human-Like Attention and Distinct Head Clusters in Self-Supervised Vision Transformers: A Comparative Eye-Tracking Study

Article Publication Date

21-May-2025

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.