Article Highlight | 25-Jun-2026

Associations of normalization and regularization with machine learning overfitting in cross-dataset classification of deaths using transcriptomic and clinical data

Xia & He Publishing Inc.

Background and objectives

Normalization can standardize and improve machine learning (ML) performance on omics data. However, it is unclear whether normalization is associated with overfitting (i.e., worse cross-dataset performance than intra-dataset performance). Therefore, we aimed to examine associations of normalization and regularization with overfitting of ML on omics data.

Methods

Using three paired transcriptomic and clinical datasets (lung adenocarcinoma: the Cancer Genome Atlas (TCGA)/Oncology Singapore; melanoma: TCGA/Dana-Farber Cancer Institute; glioblastoma: TCGA/Clinical Proteomic Tumor Analysis Consortium), we applied ANOVA-based gene selection methods, six normalization methods, and six ML models to classify cancer patients’ deaths. Balanced accuracy (BA) and area under the curve (AUC) in intra- and cross-dataset settings were compared using inferential analyses.

Results

Normalization consistently improved intra-dataset performance (median BA/AUC changes: 0.035–0.214/0.115–0.279) on all data, particularly with Z_Raw, but decreased or slightly increased cross-dataset performance (median BA/AUC changes: −0.029–0.079/0.029–0.064). Least Absolute Shrinkage and Selection Operator (LASSO) model without normalization consistently outperformed most of the ML models in cross-dataset testing across cancer types. ML models on all and molecular-alone data showed similar best performances.

Conclusions

Normalization increases ML’s intra-dataset performance and overfitting in three paired cancer transcriptomic and clinical datasets. Regularized models such as LASSO appear to mitigate overfitting and achieve robust cross-dataset performance. Therefore, cross-dataset evaluation and regularized models are recommended to assess and reduce overfitting, while normalization should be used cautiously. Adding clinical data seems to have little impact on ML models’ performance. However, future work on other diseases and datasets is warranted.

Full text

https://www.xiahepublishing.com/2771-165X/JCTP-2025-00051

The study was recently published in the Journal of Clinical and Translational Pathology.

Journal of Clinical and Translational Pathology (JCTP) is the official scientific journal of the Chinese American Pathologists Association (CAPA). It publishes high quality peer-reviewed original research, reviews, perspectives, commentaries, and letters that are pertinent to clinical and translational pathology, including but not limited to anatomic pathology and clinical pathology. Basic scientific research on pathogenesis of diseases as well as application of pathology-related diagnostic techniques or methodologies also fit the scope of the JCTP.

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.