image: leveraging protein data as the "flame" (the core source of data and knowledge) and AI methodologies as the "crucible" (the computational model and training framework). Through a data-driven "refining" process, pFMs learn and optimize from vast quantities of sequence and structural data, enabling precise predictions of protein structures and functions, and ultimately facilitating the design of novel proteins with tailored functionalities
Credit: ©Science China Press
The analysis of protein structures and functions is one of the core topics in life science research. Rapidly evolving sequencing technologies have generated hundreds of millions of protein sequences. However, only about 0.23% of sequences in the UniProt database have been expert-curated, and the number of proteins with experimentally resolved three-dimensional structures is even lower, at less than 0.1%. Confronted with the widening "knowledge gap" between protein sequences and their structures and functions, the recent emergence of Protein Foundation Models (pFMs) offers an effective solution—leveraging protein data as the "flame" (the core source of data and knowledge) and AI methodologies as the "crucible" (the computational model and training framework). Through a data-driven "refining" process, pFMs learn and optimize from vast quantities of sequence and structural data, enabling precise predictions of protein structures and functions, and ultimately facilitating the design of novel proteins with tailored functionalities (see Figure 1). A collaborative team led by Prof. Wenjie Shu et al. from the Bioinformatics Center of AMMS has published a review entitled "Protein Foundation Models: A Comprehensive Survey" online in the journal Science China Life Sciences, which comprehensively summarizes and discusses the development, applications, and research prospects of pFMs.
The review highlights that diversified datasets serve as a crucial cornerstone for the development of pFMs. These datasets encompass a wide range of information, including not only the amino acid sequences of proteins but also experimentally resolved and predicted three-dimensional structures, functional annotations, and protein-protein interaction networks. The depth of these data sources provides robust support for models to learn the complex characteristics of proteins and drives continuous iteration in model technologies.
From an evolutionary perspective, the development of pFMs has demonstrated a clear trend: transitioning from sequence-based modeling to the integration of multimodal information, and from capturing static features to simulating dynamic processes. Currently, pFMs have evolved into several mature technical approaches, each based on distinct algorithmic logics and tailored to different research needs:
- Autoencoder models leverage an encoder-decoder architecture to efficiently extract and reconstruct protein features, playing a vital role in the feature preprocessing stage of protein structure prediction.
- Autoregressive models rely on probabilistic modeling for sequence generation, excelling in de novo protein sequence design and completion.
- Diffusion models utilize a stepwise denoising generation mechanism to produce novel proteins with enhanced structural stability and controllable functionality.
- Flow matching models demonstrate unique advantages in the study of dynamic processes, such as protein dynamics simulations, due to their efficient probability density estimation capabilities.
These technical approaches evolve in parallel and complement each other, collectively advancing the in-depth exploration of protein science (see Figure 2).
In terms of application scenarios, pFMs demonstrate strong versatility and influence. In basic biological research, these models help researchers reveal the correlation between protein structure and function and analyze life regulatory mechanisms; in protein discovery and engineering, these models enable efficient design of proteins with specific functions, providing new solutions for the development of industrial enzymes and biomaterials; in the biomedical field, pFMs play an important role in the screening of disease-related protein biomarkers, identifying drug targets, and designing novel therapies, thereby opening up new avenues for the diagnosis and treatment of major diseases.
Despite significant achievements, the development of pFMs still faces many challenges. The data bottleneck is quite prominent, with relatively scarce high-quality and standardized protein benchmark datasets; the model evaluation system remains imperfect, hindering comprehensive and accurate assessment of model performance; at the same time, the interpretability of the models is insufficient, making it difficult to clearly understand the biological mechanisms behind the prediction results. All these factors restrict the further application and development of the models.
Looking forward, the research direction of pFMs is clear and well-defined. Researchers will focus on exploring the modeling of protein dynamic changes and interactions, promoting the upgrading of models from static description to dynamic prediction; meanwhile, building integrated virtual cell systems will become an important research direction. By integrating multi-dimensional biological data, pFMs will enable the simulation and prediction of complex life processes in cells.
Journal
Science China Life Sciences
Method of Research
Systematic review