News Release

Researchers develop multi-modal vision-language model for generalizable annotation-free pathology localization

Peer-Reviewed Publication

Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences

Accurate localization of pathologies in medical images is crucial for precise diagnosis and effective treatment. Existing deep learning models for defining pathology from clinical imaging data rely heavily on expert annotations, and lack generalization capabilities in open clinical environments.

In a study published in Nature Biomedical Engineering, a team led by Prof. WANG Shanshan from the Shenzhen Institute of Advanced Technology of the Chinese Academy of Sciences, along with Prof. ZHANG Kang from Wenzhou Medical University, developed a generalizable vision-language model for Annotation-Free pathology Localization (AFLoc), which is superior in localizing pathologies and supporting diagnosis across a wide range of clinical environments.

The AFLoc model contains an image encoder and a text encoder. The former extracts shallow local, deep local, and global visual representations, while the latter captures word-level, sentence-level, and report-level semantic features from clinical reports. Through contrastive learning, the model achieved multi-level semantic alignment between images and texts, enabling image interpretation without manual annotations.

This design allows the AFLoc model to automatically learn the inherent relationships between disease descriptions and the corresponding image regions, and to accurately highlight lesion areas when provided with a clinical report or automatically generated text prompts.

The researchers found that the AFLoc model has excellent lesion localization capability. After pre-trained on the MIMIC-CXR dataset, the model achieved significantly better localization performance than existing models and even surpassed human benchmarks in several disease categories of eight widely used public datasets covering 34 common thoracic diseases.

In addition, they found that the AFLoc model has strong disease diagnostic capability. The model consistently outperformed existing models in zero-shot classification tasks in chest X-rays, retinal fundus images, and histopathology slides. Its zero-shot classification ability even surpassed some models that were fine-tuned with labeled data in retinal disease diagnosis.

"By integrating visual and linguistic semantics, the AFLoc model learns directly from image-report pairs and demonstrates robust and generalizable performance across diverse imaging tasks," said Prof. WANG. The model has the potential to improve medical diagnostics. This study lays a foundation for developing scalable and efficient medical artificial intelligence systems.


Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.