In computer database applications, proximity join operations involve the finding of data objects in a database that satisfy certain similarity requirements. Examples of these include query applications on multimedia databases, medical databases, scientific databases, and time-series databases. A typical user query of these applications may require the finding of all pairs of similar images, retrieving music scores similar to a target music score, determining products with similar selling patterns, or discovering all stocks with similar price movements from the database. Typically, the data objects (with their attributes) are represented as points in a multi-dimensional space to facilitate the search of the database to find similar data objects. With such a mapping between data objects and multidimensional points, the problem of finding similar objects in the database is reduced to finding points in the multi-dimensional space that are close, or similar, to a given point. This operation is referred to as a spatial proximity (or similarity) join. Two points are said to be in proximity of each other if they are within a certain distance, according to some metrics used to measure the distance. This distance is called a similarity distance and reflects the data attributes common to the two points.
In many emerging data-mining applications, such as those finding similar time-series, it is critical to process the proximity join queries efficiently in order to obtain the result quickly with minimum data storage requirements. Prior art algorithms for multi-dimensional proximity joins may be classified as non-index based or index based. The non-index based algorithms typically use space-filling curves to map objects into one-dimensional values. This is done by partitioning the space regularly into multiple cells. A space-filling curve is drawn through the multi-dimensional space with the cells numbered in the order they are visited. Objects to be joined are then examined sequentially, and for each cell that an object overlaps, a &lt;cell-number, object-pointer&gt; pair is created. Standard relational indices and techniques for computing joins can then be used on the pairs' one-dimensional cell values. Further details on non-index based algorithms may be found, for example, in "A Class of Data Structures For Associated Searching," J. A. Orenstein et al., Proc. of the ACM Symposium on Principles of Database Systems, 1984. A shortcoming of space-filling curves is that some proximity information is always lost, so nearby objects may have very different cell values. This in turn requires a complex join algorithm.
Most of the recent work in multi-dimensional joins has focused on using indices to aid the join operation. This includes the R-tree used in "Efficient Processing of Spatial Joins Using R-trees," by T. Brinkhoff et al., Proc. of the ACM SIGMOD Conference on Management of Data, May 1994, and the seeded trees described in "Spatial Joins Using Seeded Trees," by Ming-Ling Lo et al., Proc. of the ACM-SIGMOD Conference on Management of Data, May 1994. Whatever the index used, they all follow the same schema whereby two sets of multi-dimensional objects are joined by doing a synchronized depth-first traversal of their indices. Intersection joins are handled by joining any two index buckets that overlap. Likewise, proximity joins are handled by joining any two index buckets whose boundaries are sufficiently near.
Most of these approaches are not well suited to the particular problem of proximity joins on high-dimensional points because they cannot scale to a large number of dimensions. For example, the R tree and seeded tree both use a "minimum bounding rectangle" (MBR) to represent the regions covered by each node in the index. As the number of dimensions gets large, the storage and traversal costs associated with using MBRs increases. Another drawback of these methods is their lack of skew-handling capabilities. Skewed data can cause rapid growth in the size of the index structures and increases their cost. Some of these problems are addressed by the .epsilon.-K-D-B tree described in the co-pending U.S. patent application Ser. No. 08/629,688 for "Method and System For Performing Spatial Similarity Joins On High-Dimensional Points," by Agrawal et al. Although the .epsilon.-K-D-B tree does not have many overheads and provides a very fast index structure for the join operations, the method described there is primarily oriented to a single-processor environment and operates serially. It fails to take advantage of the parallelism of a multiprocessor environment in building the index structure and performing the joins.
Virtually all of the existing work on parallelizing multi-dimensional joins has focused on joining two-dimensional geometric objects. For example, in "Parallel Processing Of Spatial Joins Using R-trees," T. Brinkhoff et al. use R-trees to join spatial objects in a hybrid shared-nothing/shared-memory architecture where a single data processor services all I/O requests. In "Algorithms For Data-Parallel Spatial Operations," E. G. Hoel et al. compare data-parallel quadtrees with data-parallel R- and R +-trees for joins and range queries on two-dimensional line segments. However, neither of these approaches deal with a pure shared-nothing multiprocessor architecture or with data spaces larger than two dimensions. Another approach to the parallel join problem is to regularly divide the data space into N or more partitions (where N is the number of processors in the system) and assign the partitions to different processors. See, for instance, "Partition-Based Spatial-Merge Join," by J. M. Patel et al., Proc. of the ACM SIGMOD Conference On Management of Data, June 1996. Here, after the space is partitioned, data is redistributed accordingly and each processor executes its joins independently. A disadvantage of this approach is that workload partitioning is performed before we know what the data looks like, which may lead to a significant workload imbalance.
Thus, there is still a need for a method for performing spatial proximity joins on high-dimensional points in parallel in a multiprocessor system that takes advantage of the system's parallelism to efficiently build the index structure and perform the joins, with a minimum amount of storage space.