1. Field of the Invention
This invention relates generally to data mining systems, and more specifically, to a system and method of mining incomplete data sets in which feature extraction operations are difficult and time consuming.
2. Discussion of the Prior Art
In recent years, massively incomplete data sets have become ubiquitous due to the fact that feature extraction operations are difficult for many unstructured data sets. In many cases, the attributes in the data sets become corrupted and difficult to handle. In other cases, the users are themselves unwilling to specify the attribute values. The result is a data set in which the values of many of the attributes are unknown. However, most known data mining algorithms are generally designed assuming that all of the attributes are specified.
There are currently several ways of mining data sets in which attributes are incompletely specified. For example, if the incompleteness occurs in a small number of rows, then one may wish to ignore these rows. As another example, if the incompleteness occurs in a small number of columns, then one may ignore those columns only. Furthermore, one may try to statistically predict the actual values of the attributes and then apply the data mining algorithms.
Each of these solutions is somewhat difficult in practice if most of attributes are incompletely specified. For example, every single record may have a few entries missing and every attribute may have some missing places for each record. In such cases, by ignoring some of the rows and/or columns, one may end up ignoring each and every entry in the data set. Furthermore, the use of statistical techniques becomes highly inaccurate when the number of missing entries in the data set are very high. In such cases, the correct values of the attributes cannot be guessed very accurately.
One of the characteristics of real data sets is that there are often considerable correlations among the different attributes. Consequently, there is considerable redundancy among the different attributes. These correlations create what is referred to as a “concept” structure in the data.
It would be highly desirable to provide a provide a technique that enables the mining of incompletely specified (unstructured) data sets even though the individual attributes are not completely reconstructed.
It would be highly desirable to provide a provide a technique that enables the mining of incompletely specified (unstructured) data sets that utilizes the correlation structure of the data in order to effectively guess the values of the concepts even though the attributes are not fully specified.