Broadly speaking, the invention relates to the field of Computer Science. Specifically, it is concerned with the design of efficient database indexing structures to speed up the access of high dimensional data points from a large repository of points stored in a computer. The points to be accessed are those that nearest to the query point.
Database management systems (DBMSs) are widely accepted as a standard tool for manipulating large volumes of data on secondary storage. To retrieve the stored data quickly, databases employ structures known as indexes. With an index, the volume of data to be fetched and processed in response to a query can be significantly reduced. In practice, large database files must be indexed to meet performance requirements.
In recent years, database systems have been increasingly used to support new applications, such as CAD/CAM systems, spatial information systems and multimedia information systems. These applications are far more complex than the traditional business applications. In particular, data objects are typically represented as high-dimensional points, and queries require identifying points that best match the query points (e.g., nearest neigbors, similarity queries), rather than exact matches. Traditional single dimensional indexing techniques, such as the B+-tree and its variants, cannot adequately support these applications. As such, new indexing mechanisms must be developed.
Many indexing methods for multi-dimensional data have been developed in the arts. Early works include hierarchical tree structures (such as R-trees), linear quad-trees and grid-files. Tree-based indexing methods perform well for small number of dimensions (and hence large fan-out of the tree nodes). However, as the number of dimensions increases, the fan-out of the tree nodes reduces. The small fan-out leads to increased overlap between node entries as well as a taller tree. The consequence is more paths will have to be traversed, and more data will have to be fetched, resulting in a rapid deterioration in performance. Linear quad-trees and grid-files also work well for low dimensionalities, but the response time explodes exponentially for high dimensionalities. It turns out that for high dimensionality, the simple strategy of examining all data objects remains the best strategy.
More recent efforts address this problem by reducing the dimensionality of the indexing attribute. One direction is to reduce the dimensionality of the data by projecting high-dimensional points on the hyperplane containing the axis. An algorithm (by Friedman, et. al. An algorithm for finding nearest neighbors, IEEE Transaction on Computers, Vol C-24, pp. 1000-1006) is to truncate the high dimension data. Another algorithm (by B. C. Ooi, et. al. Indexing the Edgesxe2x80x94A Simple and Yet Efficient Approach to Indexing High-Dimensional Indexing, Symposium on Principles of Database Systems, 2000, pp. 166-174) is to transform the high dimension data into a single dimension value based on the maximum or minimum value of the dimensions. This work, however, is designed to support window queries, and cannot be easily extended to support nearest neighbor queries (as the concept of distance/similarity is not built in). The effectiveness of techniques in this category can be reduced as searching on the projections produces false drops. Another direction is to group high dimensional data into smaller partitions so that the search can be performed by sequentially scanning the smaller number of buckets. This approach is not expected to scale for large number of high-dimensional data as the number of partitions will be too large. Moreover, it may miss some answers (e.g., Goldstein, et. al. Contrast plots and p-sphere trees: space vs. time in nearest neighbor searches, 26th International Conference on Very Large Databases, 2000, pp. 429-440). Yet another direction is to specifically design indexes that facilitates metric-based query processing. However, most of the current work have been done on high-dimensional indexing structures (which suffers from poor performance as the number of dimensions becomes large).
Therefore, it is a problem in this art to reduce the dimensionality of a high-dimensional database such that no answers will be missed and the number of false drops is kept minimum when answering a query.
The invention is a transformation-based method for indexing high-dimensional data for nearest neighbor queries. The method maps high-dimensional points into single dimensional space using a three step algorithm. First, the data in the high dimensional space is partitioned. Second, for each partition, a point is identified to be a reference point. Third, the distance between each point in the partition and the reference point is computed. The distance, together with the partition, essentially represents the high-dimensional point in the single dimensional space. Nearest neighbor queries in the high dimensional space has to be transformed into a sequence of range queries on the single dimensional space.
The invention has several advantages over existing techniques. First, the mapping function that we used is simple and computationally inexpensive. Second, because distance is a single dimensional vector, we can exploit single dimensional indexing structure to facilitate speedy retrieval. This means that the technique can be easily deployed in commercial database management systems that already provide support for single dimensional indexing. Third, the invention can produce fast approximate nearest neighbors quickly, and the answers are continuously refined until the nearest neighbors are obtained. We note that most of the existing approaches cannot prduce any answers until all the nearest neighbors are returned. Fourth, the invention is space efficient.
FIG. 1 illustrates the flow of information and control in iDistance.
FIG. 2 gives an algorithmic description of the basic KNN searach algorithm for distance-based query processing.
FIG. 3 illustrates the effects of enlarging search regions for locating KNNs.
FIG. 4 shows the search regions for NN queries q1 and q2.
FIG. 5 shows the KNN search algorithm on iDistance.
FIG. 6 illustrates the space partitioning with (centroids of (d-1)-Hyperplane, closest distance) combination.
FIGS. 7A-7B illustrate the space partitioning by (centroid, furthest distance) combination, and the query space reduction respectively.
FIGS. 8A-8B illustrate the space partitioning by (external point, closest distance) combination, and the query space reduction respectively.
FIG. 9 illustrates the cluster-based space partitioning with cluster centroid as reference point.
FIG. 10 illustrates the cluster-based space partitioning with edge as reference point.
FIG. 11 shows the effect of search radius on retrieval accuracy (dimension=8).
FIG. 12 shows the effect of search radius on retrieval accuracy (dimension=16).
FIG. 13 shows the effect of search radius on retrieval accuracy (dimension=30).
FIG. 14 shows the effect of search radius on retrieval efficiency.
FIG. 15 shows the effect of reference points.
FIG. 16 shows the percentage trend with variant searching radius.
FIG. 17 shows the effect of the number of partitions on iDistance.
FIG. 18 shows the effect of data size on search radius.
FIG. 19 shows the effect of data size on I/O cost.
FIG. 20 shows the effect of reference points in clustered data sets.
FIG. 21 shows the effect of clustered data size.
FIG. 22 shows the CPU Time performance of iDistance.
FIG. 23 shows a comparative study on uniform data set.
FIG. 24 shows a comparative study on clustered data set.