A Comparative Analysis of Machine Learning Imputation Techniques for MAR Missingness

Main Article Content

Shweta Tiwaskar, Sandip Thite, Rashid Mamoon

Abstract

Electronic health records (EHR) are essential for making informed patient care decisions., but missing data can hinder decision-making. This study addresses the issue of missing data, specifically under the Missing at Random (MAR) mechanism, which is common in real-world datasets. While statistical methods are traditionally used for data imputation, machine learning (ML) approaches offer greater flexibility and can capture complex relationships within the data. The paper evaluates three prominent ML-based imputation techniques—K Nearest Neighbor Imputation (KNNI), Multivariate Imputation by Chained Equations (MICE), and MissForest—focusing on their performance in handling MAR missingness in multivariate configurations. The study simulates MAR missingness (5%-30% of the dataset) across multiple variables and imputes the missing values using these methods. The imputed datasets are evaluated against a complete subset of the original data using several performance metrics e.g. (accuracy, F1 score, MAE, RMSE, R-squared, Pearson correlation, and BIC etc.). particularly examining correlations between missing and observed values. To calculate these performance metrics, eighteen imputed datasets are compared with one complete subset of original dataset. As compared to KNNI and MICE, MissForest imputation method demonstrated reduced SD, MAE, and RMSE in 83.33% of MR cases, and higher R-squared values in all (100%) MR cases. MissForest performs better in 100% of MR cases in all the five performance metrics of model performance. This suggests that MissForest is a superior imputation method for handling MAR missingness in multivariate settings.

Article Details

Section
Articles