In recent years, many advanced technologies have been developed to store and record large quantities of data continuously. In many cases, the data may contain errors or may be only partially complete. For example, sensor networks typically create large amounts of uncertain data sets. In other cases, the data points may correspond to objects which are only vaguely specified, and are therefore considered uncertain in their representation. Similarly, surveys and imputation techniques create data which is uncertain in nature. This has created a need for uncertain data management algorithms and applications.
In uncertain data management, data records are represented by probability distributions rather than deterministic values. Therefore, a data record is represented by the corresponding parameters of a multi-dimensional probability distribution. Some examples in which uncertain data management techniques are relevant are as follows:                The uncertainty may be a result of the limitations of the underlying equipment. For example, the output of sensor networks is often uncertain. This is because of the noise in sensor inputs or errors in wireless transmission.        In many cases such as demographic data sets, only partially aggregated data sets are available. Thus, each aggregated record is actually a probability distribution.        In privacy-preserving data mining applications, the data is perturbed in order to preserve the sensitivity of attribute values. In some cases, probability density functions of the records may be available.        In some cases, data attributes are constructed using statistical methods such as forecasting or imputation. In such cases, the underlying uncertainty in the derived data can be estimated accurately from the underlying methodology.        
The problems of distance function computation and indexing are closely related, since the construction of the index can be sensitive to the distance function. Furthermore, effective distance function computation is inherently more difficult in the high dimensional or uncertain case. Direct extensions of distance functions such as the Lq-metric are not very well suited to the case of high dimensional or uncertain data management. This is because these distances are most affected by the dimensions which are most dissimilar. In the high dimensional case, the statistical behavior of the sum of these dissimilar dimensions leads to the sparsity problem. This results in similar distances between every pair of points, and the distance functions are often qualitatively ineffective (see, e.g., A. Hinneburg, C. Aggarwal and D. Keim, “What is the nearest neighbor in high dimensional spaces?” VLDB Conference, (2000), the disclosure of which is incorporated by reference herein). Furthermore, the dimensions which contribute most to the distance between a pair of records are also likely to have the greatest uncertainty. Therefore, the effects of high dimensionality are magnified by the uncertainty, and the contrast in distance function computations is lost. The challenge is to design a distance function which continues to be both qualitatively effective and index-friendly.
The problem of indexing has been studied in the literature both for the case of deterministic data (see, e.g., N. Beckmann, H-P. Kriegel, R. Schneider and B. Seeger, “The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles,” ACM SIGMOD Conference, (1994); and S. Berchtold, D. Keim and H-P. Kriegel, “The X-Tree: An Index Structure for High Dimensional Data,” VLDB Conference, (1996), the disclosures of which are incorporated by reference herein), and for the case of uncertain data (R. Cheng, Y. Xia, S. Prabhakar, R. Shah and J. Vitter, “Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data,” VLDB Conference, (2004); R. Cheng, D. Kalashnikov and S. Prabhaker, “Evaluating Probabilistic Queries over Imprecise Data},” ACM SIGMOD Conference, (2003); S. Singh, C. Mayfield, S. Prabhakar, R. Shah, and S. Hambrusch, “Indexing Uncertain Categorical Data,” IEEE ICDE Conference, (2007); and Y. Tao, R. Cheng, X. Xiao, W. Ngai, B. Kao and S. Prabhakar, “Indexing Multi-dimensional Uncertain Data with Arbitrary Probability Density Functions,” VLDB Conference, (2005), the disclosures of which are incorporated by reference herein).