In recent years, the problem of mining uncertain data sets has gained importance because of its numerous applications to a wide variety of problems, see, e.g., C. C. Aggarwal, “On Density Based Transforms for Uncertain Data Mining,” in ICDE Conference Proceedings, 2007; R. Cheng, Y. Xia, S. Prabhakar, R. Shah and J. Vitter, “Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data,” in VLDB Conference Proceedings, 2004; N. Dalvi and D. Suciu, “Efficient Query Evaluation on Probabilistic Databases,” in VLDB Conference Proceedings, 2004; A. Das Sarma, O. Benjelloun, A. Halevy and J. Widom, “Working Models for Uncertain Data,” ICDE Conference Proceedings, 2006; D. Burdick, P. Deshpande, T. Jayram, R. Ramakrishnan and S. Vaithyanathan, “OLAP Over Uncertain and Imprecise Data,” VLDB Conference Proceedings, 2005; and H-P. Kriegel and M. Pfeifle, “Density-Based Clustering of Uncertain Data,” ACM KDD Conference Proceedings, 2005, the disclosures of which are incorporated by reference herein.
This is because data collection methodologies are often inaccurate and are based on incomplete or inaccurate information. For example, sensor data sets are usually noisy and lead to a number of challenges in processing, cleaning and mining the data. Some techniques for adaptive cleaning of such data streams may be found in S. R. Jeffery, M. Garofalakis and M. J. Franklin, “Adaptive Cleaning for RFID Data Streams,” VLDB Conference Proceedings, 2006, the disclosure of which is incorporated by reference herein. In many cases, estimations of the uncertainty in the data are available from the methodology used to measure or reconstruct the data. Such estimates may either be specified in the form of error variances (above-cited C. C. Aggarwal, “On Density Based Transforms for Uncertain Data Mining,” in ICDE Conference Proceedings, 2007) or in the form of probability density functions (above-cited R. Cheng, Y. Xia, S. Prabhakar, R. Shah and J. Vitter, “Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data,” in VLDB Conference Proceedings, 2004).
In general, a variety of online methods exist to estimate missing data values along with the corresponding errors. In such cases, this results in streams of uncertain data. Some examples in which the uncertainty of the data are available are as follows:                In cases such as sensor streams, the errors arise out of inaccuracies in the underlying data collection equipment. In many cases, the values may be missing and statistical methods (see, e.g., R. Little and D. Rubin, “Statistical Analysis with Missing Data Values,” Wiley Series in Prob. and Stats. 1987, the disclosure of which is incorporated by reference herein) may need to be used to impute these values. In such cases, the error of imputation of the entries may be known a-priori.        In many temporal and stream applications, quick statistical forecasts of the data may be generated in order to perform the mining. In C. C. Aggarwal, “On Futuristic Query Processing in Data Streams,” EDBT Conference Proceedings, 2006, the disclosure of which is incorporated by reference herein, a technique has been discussed to construct forecasted pseudo-data streams for mining and querying purposes. In such cases, the statistical uncertainty in the forecasts is available.        In privacy-preserving data mining, uncertainty may be added to the data in order to preserve the privacy of the results. For example, in some perturbation based methods (see, e.g., R. Agrawal and R. Srikant, “Privacy-Preserving Data Mining,” ACM SIGMOD Conference Proceedings, 2000, the disclosure of which is incorporated by reference herein), the data points are perturbed with the use of a probability distribution. In such cases, the exact level of uncertainty in the data is available.        