Machine learning outperforms traditional statistical methods in addressing missing data in electronic health records
Health Data SciencePeer-Reviewed Publication
A team of researchers from the National Institute of Health Data Science at Peking University and the Department of Clinical Epidemiology and Biostatistics at Peking University People's Hospital conducted a systematic review on methods for handling missing data in electronic health records (EHRs). Missing data pose significant challenges in medical research, potentially leading to biased results and reduced statistical power. This review, which analyzed 46 studies published between 2010 and 2024, compared traditional statistical techniques, such as Multiple Imputation by Chained Equations (MICE), with advanced machine learning approaches, including Generative Adversarial Networks (GANs) and k-Nearest Neighbors (KNN).
The findings revealed that machine learning methods, especially GAN-based and time-series imputation techniques like CATSI, often outperformed traditional statistical methods in addressing missing data across diverse datasets. However, no single method was universally optimal, highlighting the need for standardized benchmarks to evaluate the performance of these methodologies under various scenarios. The research team aims to develop such benchmarks and create protocols for reliable missing data handling, ensuring more robust and reproducible outcomes in healthcare studies.
- Journal
- Health Data Science