Predictive models are widely used and are often built on demographic, survey, and other data that contain many missing values. When these missing values are not handled appropriately during model building, then the predictive model is not reliable, and any decision based on the predictive model may result in losses for a company. In addition, existing techniques for imputation of missing data do not efficiently handle the very large and distributed data sources that are now encountered in practice.
Some existing techniques for imputing missing values are: mean imputation, regression imputation, multiple imputation, an Expectation Maximization (EM) algorithm, and a technique that imputes missing values while building a predictive model.
Mean imputation replaces missing values of a continuous variable with a mean value that is computed based on all non-missing records. Mean imputation may provide inaccurate results.
Regression imputation regresses the variable that has missing values on all other variables and then uses the regression equation to impute missing values for that variable. Random errors can be added to the imputed values to overcome the problem in underestimating the variance in the imputed variable. Regression imputation may provide inaccurate results if the data does not follow the assumptions of a linear regression model. For example, when the variable to be imputed is a categorical variable, the variable to be imputed has a non-linear relationship with other variables used to impute. Moreover, regression imputation may regress the variable with missing values on all other variables, so the imputation model building is not based on some basic descriptive statistics.
Multiple imputation builds imputation models for a variable that has missing values on other variables. The imputation model is a linear or logistic regression model for a continuous or categorical variable that has missing values, respectively. Multiple imputation imputes multiple, complete data sets by its imputation process. Then, an appropriate predictive model is built on each complete data set, and the results of the multiple predictive models are combined. The iterative nature of the imputation process may require many data passes to impute a single complete data set, and those data passes are multiplied when multiple complete data sets are created. Multiple imputation uses logistic regression for a categorical variable with missing values, and multiple imputation uses several data passes to obtain the solution for one variable within one imputation because, unlike linear regression, logistic regression does not have a closed form solution and needs an iterative process in which each iteration means one data pass. Thus, the existence of categorical variables with missing values would increase computation cost using multiple imputation.
An Expectation Maximization (EM) algorithm is an iterative technique that alternates between steps (1) and (2) until the process converges on stable estimates, where step (1) estimates the model parameters based on the current data set, and step (2) imputes the missing values based on those estimated parameters to update the data set. Then the fill-in data set is used to re-estimate the parameters. Typically, the EM algorithm is used under a multivariate normal model and missing values are imputed based on a regression model. Moreover, the EM algorithm is an iterative process that requires many data passes. If the predictive model of interest is more complicated than a multivariate normal model, then the EM algorithm is a system of equations which has specific forms for specific applications. Thus, applying the EM algorithm may require skill to obtain the custom-made solutions for different applications.
Another technique imputes missing values while building a predictive model. A population of solutions is created using the data set with missing values, where each solution includes parameters of the model and the missing values. Each of the solutions in a population is checked for fitness. After the fitness is checked, the solutions in a population are genetically evolved to establish a successive population of solutions. The process of evolving and checking fitness is continued until a stopping criterion is reached. This technique may need many runs of populations of solutions to reach the stopping criterion.