image: n this study, 773 untreated breast cancer patients from all over China were collected and followed up for at least 5 years. We obtained clinical data from 773 cases, RNA sequencing data from 752 cases, and proteomic data from 271 cases. We used 5 different data combinations to develop and train the model, and then compared the performance of the different feature combination models using 5 different algorithms. By optimizing the number of features, select the most important subset from a large number of features. Finally, an optimal model containing 13 features was determined. In order to make the model transparent and trustworthy, we used advanced interpretability techniques including SHAP and KAN. In addition, we will encapsulate the explanatory model into a user-friendly network tool for clinical doctors to use, enabling real-time prediction and result visualization. Finally, to further confirm the biological basis and clinical relevance of the model, we conducted immunohistochemical verification.
Credit: Zhixuan Wu, Shengnan Yao, Lingli Jin, Xue Wu, Rongrong Zhang, Ouchen Wang, Erjie Xia
Introduction
Breast cancer is a leading malignancy in women globally, characterized by complex biology and variable clinical outcomes. While diagnostic and therapeutic approaches have improved, predicting survival accurately remains difficult due to tumor heterogeneity and individual treatment responses. Conventional prognostic models based on clinicopathological features—such as tumor size, lymph node status, and hormone receptor expression—offer limited predictive power. Thus, there is a need to develop more accurate multi-omics models for predicting 5-year survival in breast cancer patients.
Advances in multi-omics technologies now allow comprehensive molecular profiling of tumors, spanning genomics, transcriptomics, and proteomics. Proteomics, in particular, reflects functional cellular states and may offer more reliable prognostic biomarkers than genomic or transcriptomic data alone. Previous proteomic studies have identified signatures linked to survival outcomes, but these have typically not been integrated with clinical variables into a unified prognostic model. To our knowledge, no study has combined proteomic profiles with standard clinical factors to predict 5-year survival in breast cancer.
Artificial intelligence (AI), especially deep learning, shows great promise in analyzing complex biomedical data. However, the "black-box" nature of many AI models limits their clinical adoption. Explainable AI (XAI) methods, such as SHAP (SHapley Additive exPlanations), address this by improving model interpretability. In particular, Kolmogorov–Arnold Network (KAN) offer advantages over methods like LIME by directly modeling functional relationships and enabling global interpretability.
This study aims to develop an interpretable AI model that integrates proteomics and clinical data to predict 5-year survival in breast cancer. Using SHAP and KAN, we will identify key predictive factors and enhance model transparency. We also plan to create a web-based tool for real-time predictions and visualization of feature contributions, and to validate key protein expressions via immunohistochemical staining.
Results
Patients’ characteristics
A total of 773 patients with breast cancer were included in this study. The baseline characteristics of train set and test set included age, menopause, histological type, histological grade, lymph node positive and other demographic characteristics. The median follow-up was 83.1 months.
Model development and feature optimization
To identify the optimal multi-omics features for predicting 5-year survival (yes or no) of patients with breast cancer, we conducted five distinct feature combinations using DNN as follows. The AUC value in the test set were 0.624 (clinical features), 0.716 (RNA-seq features), 0.711 (RNA-seq combining with clinical features), 0.720 (proteome) and 0.814 (proteome combining with clinical features). The model (proteome and clinical) exhibited the best predictive performance. Additionally, the model evaluation metrics further confirmed the superior performance of the proteome and clinical model as the accuracy of 0.811, the recall of 0.861, the precision of 0.919 and the F1-score of 0.889.
To enhance computational efficiency and reduce dimensionality, we employed a three-step feature selection strategy. First, a filter-based method reduced the feature pool to 100 candidates; second, an embedded approach narrowed it to 50; and finally, a wrapper-based technique identified the 20 most informative variables. Model performance remained stable during this process, with the 20-feature model achieving the highest accuracy (0.890) and F1-score (0.849), indicating that substantial dimensionality reduction was possible without sacrificing predictive power. These 20 features were then evaluated using five machine learning algorithms, among which DNN performed best (AUC = 0.877), outperforming XGBoost (0.644), Logistic Regression (0.792), KNN (0.643), and Naive Bayes (0.585). Subsequently, feature optimization was refined using SHAP values. Performance increased with the addition of higher-ranked features but plateaued after 13. The 13-feature model achieved an AUC of 0.864—comparable to the 20-feature model (0.877)—while offering greater parsimony. Precision (0.970), recall (0.810), and F1-score (0.883) further confirmed its robust predictive capability. Therefore, the top 13 features were selected for subsequent analyses. These comprised four clinical variables (tumor size, adjuvant endocrine therapy, lymph node positivity, and HER2 status) and nine proteins (EGFR, MPHOSPH10, ACOX2, CASP3, ARL3, KRT18, FAM102A, STEAP3, and BUB1B).
SHAP interpretation
To elucidate the efficacy of our ultimate model, we employed SHAP to explore the contributions of various features. The donut plot and bee swarm plot revealed that the significant features contributing to prediction outcome were MPHOSPH10, EGFR, ARL3, KRT18, lymph node positive, Her2 status (IHC), Adjuvant endocrine therapy, FAM102A,STEAP3, CASP3, Tumor size, BUB1B and ACOX2.Proteins such as MPHOSPH10, EGFR, ARL3, along with specific clinical traits like "Lymph node positive" and "HER2 status (IHC)", exerted a notable effect on the model's results, as evidenced by the significant variability of their SHAP values across different samples, with a relatively scattered color distribution. For specific specimens, an elevated expression of MPHOSPH10 and EGFR indicated in red could potentially result in increased predicted values from the model. On the other hand, the correlation heatmap illustrated the intensity and orientation of relationships among various features. For instance, the correlation coefficient of “adjuvant endocrine therapy” with itself was 1.00, while its correlation with “lymph node positive” was 0.19, indicating a weak positive relationship, with EGFR was -0.36, reflecting a moderate negative relationship, and with MPHOSPH10, it was -0.22, suggesting a weak negative relationship.
Furthermore, SHAP local explanation elucidated how individual characteristics impacted the probability of a 5-year survival prediction (yes or no) for each patient. The predictive result of the patient with breast cancer was no 5-year survival with f(x) was 0, whereas the outcome was 5-year survival as f(x) was 1.
To assess the robustness of the model across different molecular subtypes, stratified analyses were performed for Luminal A, Luminal B, HER2-enriched, and triple-negative breast cancer (TNBC). The model achieved consistently high predictive performance, with AUCs of 1.00 for Luminal A, 0.98 for Luminal B, 0.96 for HER2-enriched, and 0.92 for TNBC. SHAP analysis further revealed distinct patterns of feature importance within each subtype. In Luminal A, adjuvant endocrine therapy, MPHOSPH10, and EGFR were the most influential predictors, whereas lymph node positivity, ARL3, and MPHOSPH10 dominated in Luminal B. For HER2-enriched tumors, lymph node positivity, EGFR, and MPHOSPH10 ranked highest, while in TNBC, MPHOSPH10, ARL3, and EGFR showed the strongest contributions. These findings demonstrate that the model maintains excellent predictive ability across heterogeneous subtypes, while also capturing subtype-specific prognostic drivers.
KAN interpretation and optimization
To further enhance the transparency and interpretability of the optimal model, the model with 13 features integrating proteomics and clinical features was validated and interpreted by KAN. Receiver Operating Characteristic (ROC) curve analysis demonstrated an exceptional classification performance with the AUC value of 0.81, indicating a strong discriminative power in 5-year survival prediction for breast cancer. The KAN network topology was visualized in detail, elucidating the output results of a 5-year survival prediction model for patients with breast cancer, incorporating protein markers (EGFR, MPHOS10, ACOX2, CASP3, ARL3, KRT18, FAM102A, STEAP3, BUB1B.), clinical features such as tumor characteristics including tumor size and lymph node metastasis and treatment factors including endocrine therapy. Furthermore, fitted function analysis identified that MPHOSPH10 (R²=0.92) and Tumor size (R²=0.95) as key contributors to the model's prediction outcomes, exhibiting significant linear correlations between key features and prediction outcomes. Application of KAN allowed us to quantify the functional form of each feature–outcome relationship and to identify features exhibiting strong linear dominance (e.g., MPHOSPH10, tumor size). This level of mechanistic interpretability would be difficult to achieve with local surrogate models such as LIME, which do not directly characterize the global feature–outcome mapping. These findings further validate KAN's robust capability in deciphering complex interactions within biomedical data.
Online prediction tool
To enhance the clinical application of the final model, we developed an intuitive web application based on the Streamlit Python framework, enabling visual deployment of 5-year survival prediction model of breast cancer. Clinicians or patients could input the values of 13 key characteristics on the left interactive interface. The 5-year survival prediction result could be real-time assessed by SHAP force plot on the right, illustrating each feature's contribution to the prediction. This online prediction tool not only supports rapid clinical decision-making but also enhances clinicians' trust in the model through SHAP interpretation. The application could be accessible via the following link (https://ai-model-jhwvgzhyqyimdbvhptcxrp.streamlit.app/).
Key target validation
Finally, we investigated the protein expressions of the nine proteins of the prognostic predictive model via HPA database. Immunohistochemical staining analysis revealed that MPHOSPH10, EGFR, ARL3, KRT18, STEAP3, CASP3, and ACOX2 were differentially expressed in breast cancer compared to normal tissues. In addition, the overall survival analysis of the key proteins was also externally validated using GEO database via kaplan-Meier plotter. Subsequently, RNA sequencing was conducted on breast cancer tissues to elucidate the expression levels of pivotal targets. The analysis revealed a significant upregulation of BUB1B and EGFR in breast cancer tissues compared to adjacent non-cancerous tissues, whereas ACOX2 expression was notably downregulated in the cancerous tissues.
In conclusion, our interpretable multi-modal model combining proteomics and clinical data demonstrates robust performance in predicting 5-year survival in breast cancer patients. The identified protein markers revealed by SHAP and KAN, particularly MPHOSPH10, represent promising prognostic biomarkers and therapeutic targets that merit further investigation. The integration of advanced interpretability techniques and development of an accessible prediction tool enhances the potential for clinical translation, supporting the vision of precision oncology in breast cancer management.
Method of Research
Experimental study
Subject of Research
People
Article Title
An Interpretable Machine Learning Model for Predicting 5‐Year Survival in Breast Cancer Based on Integration of Proteomics and Clinical Data.
Article Publication Date
7-Oct-2025