With the great success of large language models, self-supervised pre-training technologies have shown the great promise in the field of drug discovery. In particular, multimodal pre-training models have emerged as a key technique for drug discovery. This Review summarizes the foundation of molecular modalities, and comprehensively revisit the popular network frameworks, self-supervised tasks, training strategies and their application in drug discovery. In addition, this work emphasizes the adaptability between various modalities and network frameworks or pre-training tasks. Simultaneously, This Review discusses the difference and relevance between various modalities or pre-training models.
Based on the previous works, this Review found two increasing trends: (1) Transformers and graph neural networks are often integrated as encoders and then combined with multiple pre-training tasks to learn cross-scale molecular representation, thereby promoting the accuracy of drug discovery. (2) Molecule captions as brief biomedical text provides a bridge for collaboration of drug discovery with large language models.
Finally, this Review discusses the challenges of multimodal pre-training models in drug discovery, and explore future opportunities that include the unified network frameworks, discrimination of cross-modal consistency and complementarity, integration of more modalities, domain knowledge, and general-purpose foundation models.
To promote development of multimodal pre-training models in the felid of drug discovery, this Review collected the key data and materials at github https://github.com/AISciLab/MultiPM4Drug.git.