Multidimensional similarity join finds pairs of multi-dimensional points that are within some predetermined and typically small distance of each other. The "dimensions" may be any quantifiable property or characteristic and need not be limited to spatial dimensions as the term is routinely used. While some traditional applications, such as 2-D or 3-D mapping applications only require two or three dimensions, many important emerging applications require the number of dimensions to be quite large--possibly in the tens or hundreds, even thousands. Application domains include multimedia databases [See Relevant Literature description section following for references 11, 16, 17], medical databases [5, 21], scientific databases [22], and time-series databases [9, 1, 14]. Such databases may be constructed from collection and monitoring of physical systems such as medical instrumentation, including for example, thermometers, blood pressure sensors, blood pressure monitors and sensors, brain wave sensors and sampling systems, blood chemistry, diagnostic histories, and all other manner of medical, chemical, biological, physiological or other data. Data collected from remote sensing apparatus including photographic imagery data collected from hand-held, orbital, or other sensors, radars, and the like, cultural and other Geographical Information System (GIS) type parameters, data, and other information. Other physical systems may likewise be monitored, and the collected data signals may be categorized, stored in databases, and used for example for controlling other processes or as a decision metric when used in comparison to historical databases. These characteristics form one or more values of a multi-valued multi-dimensional data point.
Typical examples of similarity join applications include: finding all pairs of U.S. mutual funds with similar price histories; discovering images that are similar to each other; and identifying patients with similar symptoms over time; to mention only a few. Similarity join operations may also be used for "data mining" operations.
A pair of points is considered "similar" if the distance between them is less than epsilon (.epsilon.) for some distance metric, where E is a user-defined parameter. In this description, we use L.sub.p -norm as the distance metric and it is defined as: ##EQU1## where p identifies the particular distance metric, d is the dimensionality of points x and y, each of which is d-dimensional. L.sub..infin., is defined as the distance metric: EQU L.sub..infin. =max.sub.i=1.sup.d .vertline.x.sub.i -y.sub.i .vertline..
Note that if the number of dimensions (d) is 3 or less, the similarity join can be thought of as being of a spatial nature and the join can be called a "spatial similarity join". Note that L.sub.p is a class of distance metrics, where p identifies the particular metric. L.sub.1 is conventionally referred to as the Manhattan distance and is derived from the distances along two orthogonal directions; L.sub.2 is the Euclidian distance computed on the basis of a direct line between the two points of interest; and L.sub..infin. is another distance computed as the maximum distance along one of a plurality of dimensions. The distance metrics are conventionally known and not described further. Several data structures have been proposed for multidimensional similarity join including the R-tree family (R-tree, R*-tree, R.sup.+ tree) [8, 20, 10, 6], grid-file [18], k-d-b tree [19, 7], SS-tree [23] and SR-tree [12] indices. However, generally these and other known data structures are not efficient for performing similarity joins on high-dimensional points because their time and space complexity increase rapidly with increasing dimensionality. For example, a data structure that may be usable for two- or three-dimensional points might typically be unusable for ten- or hundred-dimensional points.
An earlier database procedure involved the K-d (or Kd) tree which was a memory resident procedure; while later extensions or improvements resulted in the K-d-B (or KdB) versions for disk resident implementations, that is for implementations when the database was too large to be entirely memory resident simultaneously. These earlier structures and implementations are known and not described further.
The .epsilon.-k-d-B tree has been proposed by Agrawal et al. [2] as a multidimensional index structure for performing similarity join on high-dimensional points [2], and is purported to be better than other conventional data structures for performing the similarity join on high-dimensional points. In particular, it is purportedly faster than the R.sup.+ tree on some synthetic and real-life datasets. The .epsilon.-k-d-B tree index structure of Agrawal et al. [2] uses a static constant threshold for the leaf size. When the number of points in a node falls below the fixed leaf size threshold, then further tree construction is stopped for that node and the node is declared to be a leaf node. Reference [2] is incorporated by reference in its entirety.
Therefore, while procedures for performing multi-dimensional similarity joins have evolved somewhat, there remains a need for efficient procedures, method and structure for performing similarity joins on high-dimensional data sets. There also remains a need for methods and structures that are scalable so as to provide efficient joins for large data sets and a large number of processors. Furthermore, there remains a need for a better load balancing load metric that can partition the data set among a plurality of processors based on the characteristics of the data set itself and not limited by a prior statically determined tree characteristic, or solely on equal point distribution of points among processors. These and other problems with conventional structures and methods are solved and will be apparent in light of the detailed description of the invention and accompanying drawings.