1. Field of the Invention
The present invention relates to the field of computing. More particularly, the present invention relates to a method for processing queries in a database management system.
2. Description of the Related Art
Imagine a database DB consisting of points from S=D.sub.1 x . . . x D.sub.n, where D.sub.i .OR right.R. For present purposes, the following discussion will be restricted to the case of S=D.sup.n, where D is the set of rational numbers in [0,1) having denominators that are powers of 2, although the results can be extended to a general case. Each D.sub.i usually consists of either integers or floating-point numbers. Each point in the database can be represented by an n-tuple (x.sub.1, . . . x.sub.n) of real numbers, where n can be on the order of, for example, 100. A k nearest-neighbors query consists of a query point q=(q.sub.i, . . . , q.sub.n).EPSILON.D.sup.n and an integer k representing the number of database points that are to be returned as being near to the query point. The query point may not necessarily be in the database. The sense of the nearest can be with respect to a Euclidean metric or another l.sub.p -metric. An (exact) output set O consists of k points from the database such that EQU .A-inverted.p'.EPSILON.O and .A-inverted.p".EPSILON.DB.backslash.O.parallel.p'-q.parallel..ltoreq..para llel.p"-q.parallel..
If the database is large and a quick response is required, a good approximate output is usually sought. The approximate output can be a set of points that overlaps the exact output set O to a large extent, or a set of points having distances to the query point that are not much larger than the distances of the exact output set to the query point. Image features, for example, are sometimes mapped into D.sup.n within an image database management system (DBMS). Image similarity is determined based on a distance between features in D.sup.n. A similarity metric for the image DBMS can be an approximation of a desired degree of similarity, so adding a small approximation error associated with an approximate output would not significantly affect the results.
A problem associated with a k nearest-neighbors query is how a DBMS application processes such a query so that a suitable approximate output is returned within a desired response time. The goodness of an approximate output depends, of course, on each application, and the response time depends on the processing needs of the DBMS, such as disk I/O and CPU time. All currently known methods for generating an output to a k nearest-neighbors query require calculation of distances between the query point and many database points. The computational effort is dominated by the number of distance calculations. Data points are fetched from random locations in the database. A database having a high dimensionality requires that the database indexes that are used cannot significantly restrict the number of points that must be fetched. In many cases, a complete linear scan of the database out-performs the currently known methods for generating an appropriate output.
Various approaches have been tried for determining near neighbors in a database, such as by using bounding boxes or spheres for indexing multidimensional data, by using projections for inducing ordering on database points, and by clustering data points. Most approaches are, nevertheless, limited by the dimensionality of the database. Results based on databases having two- and three-dimensions can be quite misleading when extrapolated to databases having higher dimensions.
For a bounding box or a bounding sphere approach, many hierarchical structures, such as the R-tree family, the hB-tree family and the TV-tree, have been proposed for indexing multidimensional data in database management systems that collect data points into disk pages and compute bounds on the points in the disk page. A bound is usually a minimal bounding box that is parallel to the axes of the system, however, bounding spheres (sphere trees) and convex polyhedra (cell-trees) have also been tried. Nevertheless, the only structures that are used in practice are those of the R-tree family. The collection of disk pages containing the points is stored at the bottom level of the hierarchical structure. The next level up is created by taking a collection of bounds, e.g., the bounding boxes, and treating the collection as data points. The data points are then collected into groups, each fitting on a disk page, and a new bound is computed for each group. The new bound is of the same type as the bounds of a lower level. For example, if boxes are used, then the upper levels use boxes as well. Levels are created until the top level fits on one disk page.
A range query to a structure amounts to specifying a region. Points in the region are determined by going down the structure. If the region overlaps the bounds of a page, then the subtree under that page may contain points that are in the region. A nearest-neighbors query is processed as a query about a small spherical region. If the result does not contain enough points, then the region is enlarged by, for example, doubling the radius of the sphere.
Conventional bounding approaches work well as long as the indexing structure behaves well. That is, the complexity of searched-for points in a region should be proportional to the volume of that region, assuming the region is convex and "nice". This assumption may be justified only in databases of very low dimension, that is, up to 4 dimensions. Careful construction of a hierarchical structure may relax this constraint slightly, permitting searches in 6 or 7 dimensions. For higher dimensions, though, conventional bounding approaches perform far worse than a sequential scan of the entire data set.
A one-dimensional projection approach induces an ordering on a set of database points so that a projection of a query point within the ordering can be quickly located. Projections are continuous, so close points have close projections. On the other hand, distant points may also have close projections. The higher the dimensionality of the database, the more severe the problem of distant points having close projections becomes. That is, for a high dimensional query, when candidates or nearest neighbors are selected from among the points having projections that are close to the projection of the query, the number of candidates becomes large. Good nearest neighbor candidates should have many close projections. Nevertheless, the problem of determining good candidates is closely related to the nearest neighbors problem itself. An interesting theoretical result is reported by D. P. Huttenlocher and J. M. Kleinberg, "Comparing point sets under projection," in Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms (1994), pp. 1-7. The difference between orderings based on one-dimensional projections and orderings based on space filling curves is that, in the latter, proximity in the ordering implies proximity in the space, so if there are sufficiently many orderings, it suffices to consider candidates who are close in at least one of the orderings.
Clustering the database points into clusters reflecting proximity is believed to help in the search for near neighbors. Each cluster is represented by either a database point in the cluster or by the centroid of the cluster. The cluster having a representative point that is closest to the query point is searched first. Other clusters are searched in order of proximity of their representative points to the query point or based on bounds that are derived in various ways. The search is expected to end without checking all the database points. As with other conventional approaches, clustering approaches may break down in a high dimensional database because it may not be possible to identify sufficiently many clusters as being distant.
What is needed is a technique for providing a fast response to an k nearest-neighbors query.