1. Field of the Invention
This invention relates generally to data classification and, more particularly, to methods and apparatuses for classifying uncertain data.
2. Description of Background
Data collection methodologies rely upon incomplete, inaccurate, or uncertain information. For example, information collected from surveys is typically incomplete. As a result, the missing information must be imputed or ignored altogether. Moreover, many data sets have attribute values which are based upon approximate measurements or imputed from other attributes. In privacy-preserving data mining applications, data perturbations are explicitly added to the data in order to mask sensitive values. In other cases, a set of base data intended for use with a data mining process may itself represent an estimation or extrapolation from one or more underlying phenomena.
It is often possible to obtain a quantitative measure of the errors and uncertainties in a set of data. For example, a quantitative estimation of noise for each of a plurality of data fields may be available. Moreover, many existing scientific methods for data collection have error estimation methodologies built into the data collection and feature extraction process. When data inaccuracy arises out of the limitations of data collection equipment, the statistical error of data collection can be estimated by prior experimentation. This approach enables collecting different features of observation to different levels of approximation. In situations where a data set is generated by a statistical method such as forecasting, errors in the data set can be estimated from the statistical methodology used to construct the data. In the case of missing data, imputation procedures may be employed to estimate one or more missing data values wherein the statistical error of imputation for a given data entry is often known a priori.
While data collection methodologies have become increasingly sophisticated in recent years, the foregoing error estimation procedures are inadequate in the context of many data mining applications. The usefulness of existing data mining procedures is compromised by uncertain data. Attributes which have high levels of error are accorded the same weight as attributes having low levels of error, thereby providing data mining results that are inaccurate or misleading. Thus, if an underlying set of data is not of high quality, one cannot expect algorithms performed on such data to yield useful results. Accordingly, what is needed is an improved technique for classifying uncertain data.