Data classification is important in the process of extracting information and knowledge from data sets (i.e., data mining), especially data sets that are high dimensional and sparse in nature. Such high-dimensional data sets have recently been referred to as “big data.” As is known, the size of a data set characterized as big data is so large as to be beyond the ability of commonly used software tools to manage/process the data at all, or at least within a suitable time frame. For instance, the high dimensionality associated with big data typically results in poor performance of existing data classifiers used to classify new data records.
Typically, a data classifier is learned from the steps of: data preprocessing; model training; and model evaluation. For better accuracy, after the model evaluation step, the data preprocessing and model training steps can be reviewed, parameters tuned, and then the entire classifier learning process can be re-run. However, this process is not well suited for big data analytics. One iteration of the process itself can be cost prohibitive, let alone multiple iterations. As such, there is a need for techniques to improve the performance of data classifiers used to classify high-dimensional data sets including, but not limited to, data sets characterized as big data.