A string similarity join is a technique for detecting all pairs of a given element s and a given element r from element sets S and R, respectively, in a manner such that a distance between strings contained in the individual elements of each pair satisfies a condition of a threshold value. For the measurement of distances between strings, there exist various types of distance scales having different characteristics such as Jaccard index, cosine index, and edit distance.
The edit distance represents the minimum number of procedures (inserting, deleting or replacing a letter) necessary for converting one string to another string. For example, it determines how many procedures including inserting, deleting or replacing letters are necessary in order to calculate an edit distance between two strings “kitten” and “sitting” to convert the word “kitten” (or “sitting”) into the word “sitting” (or “kitten”). In this case, the string “kitten” can be converted into the string “sitting” by replacing “k” with “s,” replacing “e” with “i,” and inserting “g.” Thus, the edit distance between the string “kitten” and the string “sitting” is three (replacing twice and inserting once).
Hereinafter, the string similarity join is also simply referred to as a string join or join. Further, a tuple set serving as an input of the string similarity join is also referred to as data or input data. Each tuple set contains at least one tuple. The tuple is formed by plural attribute values. The tuple contained in the input data contains at least one string as an attribute value. Hereinafter, an attribute having a string set thereto as the attribute value is also referred to as a string attribute. The string attribute used as a key in the string similarity join is referred to as a join key attribute, and a value of the join key attribute is referred to as a join key or join key string.
FIG. 20 is a schematic view illustrating an example of the string similarity join employing the edit distance. In the example illustrated in FIG. 20, the tuple sets S and R serve as the input data. The input data S and R each have a string attribute “product number,” and the string attribute “product number” is used as the join key attribute. The string similarity join detects all pairs of a tuple s and a tuple r that satisfy a condition in which the edit distance between a join key of the tuple s contained in the input data S and a join key of the tuple r contained in the input data R is less than or equal to a predetermined threshold value (for example, two).
Hereinafter, the edit distance between the join key of the tuple s and the join key of the tuple r is also referred to as an edit distance between the tuples s and r, an edit distance of a tuple pair (s, r), or an edit distance between a tuple s and a tuple r. Further, in the case where an edit distance of a certain tuple pair is less than or equal to a predetermined threshold value τ, the tuples s and r of this pair are referred to as “having similarity.”
In the left table in the lower portion of FIG. 20, the output results of the string similarity join process are shown. In the example illustrated in FIG. 20, four pairs having the edit distance less than or equal to 2 are outputted. In the example illustrated in FIG. 20, each of the tuples of the outputted pair is indicated with a tuple pointer formed by a tuple identifier for identifying a tuple and a data identifier for identifying a tuple set (data) containing this tuple. The tuple identifier is a value of an attribute TID. Further, for example, in a first line of a table located left below in FIG. 20, a pair of a tuple s indicated with a tuple pointer (S:101) and a tuple r indicated with a tuple pointer (R:201) is shown. In the example illustrated in FIG. 20, the tuple s is a tuple having a value of the attribute TID of 101 in an input data S, and the tuple r is a tuple having a value of the attribute TID of 201 in the input data R.
The table located right below in FIG. 20 shows an integrated state in which the tuple s and the tuple r in each of the pairs contained in the results of the string similarity join process shown in the table located left below are integrated into one tuple.
Methods of the string similarity join employing such an edit distance are proposed, for example, in Non-patent Documents 1 to 4 below. These methods employ different approaches according to average string lengths of input data serving as a target. Here, the average string length of the input data means an average of lengths of strings (number of characters) serving as the join key in each input tuple. Thus, when the average of the lengths of strings serving as the join key in each tuple is short, it is indicated that the input data has a short average string length.
In the method proposed in Non-patent Documents 1 to 3, the target is set to input data having a relatively long average string length such as a text. In general, the time required for calculating the long edit distance between strings is long. Thus, in the case where data having a long average string length is targeted, the time required for the string join process increases. In view of the facts described above, the methods proposed in Non-patent Documents 1 to 3 subject the join key to signature to convert the join key into short bit stream, calculate a distance between signatures (or degree of similarity), and leave pairs of tuples that are highly likely to have a similarity (filtering). Thus, by calculating edit distances only for filtered pairs from among all the pairs in the input tuple (refining), it is possible to increase the speed of the string similarity join process.
Non-patent Document 4 proposes an approach different from the filter-and-refine approach, and targets data having a relatively short average string length. The method proposed in Non-patent Document 4 first stores all the join keys of the input data S and R in one trie (Trie). The trie represents a data structure that can express plural strings in a compressed manner, and is frequently used as an index for the string. In general, with the trie that stores a set formed by short strings, it is possible to search the tree in a relatively short period of time. The method proposed in Non-patent Document 4 searches the trie that stores all the join keys, and calculates the edit distance between the join keys, thereby performing the join for the data having relatively short average string lengths at a relatively high speed.
As described above, with the string similarity join, the edit distances are calculated for all the pairs of tuples in the input data S and the input data R, and hence, the time required for the processing increases with an increase in the data volume in the input data S and the input data R. In view of the facts described above, Non-patent Documents 5 and 6 propose a method of processing the string similarity join in parallel to reduce the time required for the entire processing. The method proposed in Non-patent Document 5 employs the filter-and-refine approach in a parallel manner, and is suitable for data having a long average string length. The method proposed in Non-patent Document 6 employs a distance scale different from that for the edit distance, and performs the parallel processing for the string similarity join using characteristics of the distance scale.