1. Field of the Invention
The present invention relates to a high dimensional similarity join method, and more particularly, to a partition-based high dimensional similarity join method for allowing similarity to be efficiently measured by beforehand dynamically selecting space partitioning dimensions and the number of the partitioning dimensions using a dimension selection algorithm.
2. Description of the Prior Art
In general, multimedia data such as audio, video, images and text, time-series data indicating a sequence over a period of time, and a large amount of business data used for various data warehouses first go through preprocessing procedures and are then mapped to points on a high dimensional space for search, management and the like, as shown in FIG. 6. The similarity between the mapped data is measured based on the Euclidean distance between data in the high dimensional space. For example, the similarity between two image files is measured based on the distance between two points mapped onto the high dimensional space.
The term ‘similarity join’ is defined as a method of efficiently retrieving similar data among data sets when high dimensional data are provided as input data from huge multimedia databases, medical databases, scientific databases, time-series databases and the like, and the similarity join is indispensably required in high dimensional data systems such as image and multimedia data systems, time-series data systems and the like.
The similarity join can be modeled as follows.
Assuming that data sets R and S exist in a d-dimensional space and arbitrary elements r and s for the data sets R and S are represented as r=[r1, r2, . . . rd], s=[S1, S2, . . . Sd], respectively, a similarity join query can be formulated as follows:
                              R          ⁢                                  ×                                ⁢          S                =                  {                                                    (                                  r                  ,                  s                                )                            ❘                                                                    (                                                                  ∑                                                  i                          =                          1                                                d                                            ⁢                                                                                          ⁢                                                                                                                                                            r                              i                                                        -                                                          s                              i                                                                                                                                p                                                              )                                                        1                    /                    p                                                  ≤                ɛ                                      ,                          r              ∈              R                        ,                          s              ∈              S                                }                                    (        1        )            where p is a special distance metric, ε is a cutoff similarity value as a user-defined parameter, and only data pairs of which the spatial distances are smaller than ε among data pairs consisting of the elements of the data sets R and S are returned as results.
Conventional similarity join methods are well applied to low dimensional data but are very inefficient for the high dimensional data requiring very large dimensions, i.e. 10 or 100, even 1000 dimensions, in view of performance time and system storage requirements.
Typical examples of conventional similarity join methods may include a similarity join method based on the ε-kdB trees (“High dimensional similarity joins” by K. Shim, R. Srikant and R. Agrawal, Proceedings of the 1997 IEEE International Conference on Data Engineering, 1997) and a similarity join method using the ε-grid order (“ε-grid order: An algorithm for the similarity join on massive high dimensional data” by C. Böhm, B. Braunmuller, F. Krebs, and H. -P. Kriegel, Proceedings of the 2001 ACM-SIGMOD Conference, 2001).
In the similarity join method based on the ε-kdB trees, a data space is divided into cells having an area of ε along one dimension axis and data are stored in the cells, and the ε-kdB trees having multi-dimensional index structures are constructed with respect to respective cells. This method can efficiently reduce the number of joins by limiting the partitioning area for the data division in ε unit. However, since the ε-kdB tree structures indicating the respective partitions must be held in the system storage, the required system storage is also increased as space dimensions are increased. As a result, the time required for performing the similarity joins also increases proportionally.
In addition, in the algorithm of performing the similarity join using the ε-grid order, the similarity join for the high dimensional data is performed based on special ordering of the data which is obtained by laying grids having a cell length of ε over the data space and then comparing the grid cells in lexicographical order. This algorithm can provide efficient scaling of very massive data sets even with limited storage contrary to the method using ε-kdB trees. However, there is a disadvantage in that since all points between p−[ε.ε. .ε] and p+[ε.ε. .ε] must be considered in order to search join pairs of p, as shown in FIG. 7, the number of searched grid cells in an interval gets very large as dimensions increase, resulting in an increased performance time.
Meanwhile, although space partitioning methods used in low dimensional space data systems may be applicable to similarity joins in a high dimensional data space, it is not desirable from a practical point of view in that they require space partitioning for all the dimension axes. In other words, since the number of cells that result from partitioning explodes as the number of dimension axes participating in the partitioning increases (for example, if each dimension axis is divided into 10 continuous sub-intervals, the numbers of cells generated for 8, 16, 32 and 64 dimensions are 108, 1016, 1032, and 1064, respectively), it is likely that these numbers are usually larger than the number of points in the original data sets before being partitioned. If the number of cells that result from partitioning gets larger, a data skew phenomenon is excessively generated. Thus, the algorithm itself based on space partitioning becomes inefficient. Herein, the data skew phenomenon means that when high dimensional spaces are divided into cells, the data distribution in the cells is not uniform, as shown in FIG. 8.
Therefore, there is a need for a new similarity join method in which similarity joins for high dimensional data can be efficiently performed within a short period of performance time and massive storage space is not required during performance of the similarity joins.