In computer database applications, similarity joins involve the finding of data objects in a database that satisfy certain similarity requirements. Examples of similarity joins include query applications using multimedia databases, medical databases, scientific databases, and time-series databases. In such applications, a user query typically requires the finding of all pairs of similar images, retrieving music scores similar to a target music score, determining products with similar selling patterns, or discovering all stocks with similar price movements from a target database. In many emerging applications of databases similar to these examples, the efficient processing of similarity join queries is essential where the data is represented as points in a space of high dimensionality.
The representation of key attributes of the data objects as points, or spatial data, in a multi-dimensional space is necessary to facilitate the search of the database to find similar data objects. With such a mapping between data objects and multi-dimensional points, the problem of finding similar objects in the database is reduced to finding points in the multi-dimensional space that are close, or similar, to a given point. This operation is referred to as a spatial similarity join. Two points are said to be close to each other if they are within a certain distance of each other, according to some metrics used to measure the distance. This distance is referred to as a similarity distance and includes data attributes that are common to the two points. A closely related problem is to find all pairs of similar data objects, which translates into finding all pairs of similar points that satisfy the distance requirement. Prior art algorithms for mapping data objects into points in a multi-dimensional space are described, for instance, in "Fastmap: A Fast Algorithm for Indexing, Data Mining and Visualization of Traditional and Multimedia Datasets," by C. Faloutsos and K. -I. Lin, Proc. of ACM SIGMOD, pp. 163-174, 1995.
In addition to mapping data objects into multi-dimensional points, similarity join methods typically use a data structure or index for organizing the points so that they can be efficiently accessed during the join operation. Current spatial access methods have mainly concentrated on storing map information, which is usually a 2-dimensional or 3-dimensional space. The data structures commonly used in existing spatial access methods include the R-tree family, K-D-B tree, hB-tree, TV-tree, and Grid-file. These data structures are described, for instance, in "R-trees: A Dynamic Index Structure for Spatial Searching," by A. Guttman, Proc. ACM SIGMOD, pp. 47-57, 1984, and "The Design and Analysis of Spatial Data Structures," by H. Samet, Addison-Wesley, 1989.
While existing spatial similarity join methods work well in cases of low dimensional data points, they are inefficient in terms of execution time and the system storage required in performing the join operation when the number of dimensions of the space is large. A space of high dimensionality is usually needed for representing data with complex attributes such as images, financial models, etc. The poor performance and large storage requirement associated with existing spatial join methods are due to the fact that the data structures in these methods were designed mainly for points in a low-dimensional space.
For instance, consider a typical prior art join method based on the R-tree or the K-D-B tree data structure. The R-tree is a balanced tree, i.e., the path from its root to each leaf node is the same. Each node of the tree represents a rectangular region in the space, and each internal node of the tree stores a minimum bounding rectangle (MBR) for each child node of the internal node. The K-D-B tree is similar to the R+tree of the R tree family, except its bounding rectangles cover the entire space. In forming these trees, the leaf nodes are split equally in every dimension as the nodes are traversed. This leads to a very large number of leaf nodes that are within a specified distance of a given leaf node in the case where the number of dimensions is high. In an n-dimensional space, there will be O(2.sup.n) leaf nodes within the specified distance of every leaf node.
The large number of leaf nodes results in poor performance and utilization of the system storage, since there are more nodes to traverse. A prior art join algorithm typically traverses each leaf node, extends the MBR of the leaf node with similarity distance, and finds all leaf nodes whose MBR intersects with the extended MBR. The algorithm then performs a nested-loop or sort-merge join for the points in the leaf nodes with intersecting MBRs. Thus, because of the large number of leaf nodes, the number of joins performed, which is proportional to the number of examined leaf nodes, is undesirably large.
In addition, in performing the joins, system storage is needed to store data relating to the bounding regions associated with the tree nodes. For instance, for a K-D-B tree, the bounding rectangles are represented by the "min" and "max" points of the rectangles, which are typically maintained in storage during the similarity joins. The required system storage increases linearly with the number of dimensions of the space, and becomes undesirably large as the number of dimensions increases.
Furthermore, since the bounding regions corresponding to the child nodes of a node must be checked to determine whether to traverse the subtree starting from the examined node, the execution time for the method increases proportionally to the number of dimensions of the data points. Like the required system storage, this execution time also becomes undesirably large as the number of dimensions increases.
There is thus a need for an efficient method for performing spatial similarity joins on high-dimensional points that has a short execution time, based on an efficient data structure, and does not require a large amount of storage space during the performance of the similarity joins.