1. Field of the Invention
The present invention relates to a search apparatus, and in particular, to a similarity search apparatus for searching a string made of units (referred to as a unit string hereinafter) whose distance defined by the minimum number of editing operations becomes a value of not greater than a predetermined threshold value. In this case, the unit is a character, a letter, a word or the like.
2. Description of the Prior Art
Conventionally, as a similarity between a letter string and a letter string, there has been proposed a method for calculating a minimum edition cost. For example, a first prior art document of "R. A. Wagner et al., "The String-to-String Correction Problem", Journal of the ACM, Vol. 21, No. 1, pp. 168-172, 1974" has proposed a method for calculating the number of editing operations required for putting two letter strings into identical letter strings as a measure of similarity in terms of replacement, deletion and insertion in units of letters defined as an editing operation (this method referred to as a first prior art hereinafter). It is now considered the similarity between, for example, a letter string "abcd"and a letter string "ace". According to the first prior art method, the letter string "abcd" can be transformed into the letter string "ace"by deleting the letter "b" and replacing the letter "d" with "e", and therefore, the similarity between the two letter strings can be calculated as two.
Likewise, a second prior art document of "S. M. Selkow et al., "The Tree-to-Tree Editing Problem", Information Processing Letters, Vol. 6, No. 6, pp. 184-186, 1977" has proposed a method for calculating the similarity between tree structures in terms of replacement, deletion and insertion in units of nodes defined as editing operation (this method referred to as a second prior art hereinafter).
Further, a method for executing similarity search with the minimum number of editing operations defined as a similarity and applying it to letter correction or the like has been disclosed in a third prior art document of "Sun Wu et al., "Fast Text Searching Allowing Errors", Communications of The ACM, Vol. 35, No. 10, pp. 83-91, October, 1992", proposing a similarity search method with the minimum number of editing operations defined as a similarity (this method referred to as a third prior art hereinafter). According to this third prior art method, first of all, partial search keys obtained by equally dividing a search key into "predetermined threshold value +1" units are used for search on the basis of complete coincidence. In order that a similarity between a certain symbol string and a search key shall become a value of not greater than a predetermined threshold value, any of the partial search keys is required to be included in the symbol string. Therefore, by executing similarity calculation only on the proximity of an obtained result, a similarity search result can be obtained. In particular, because the whole process is implemented by bit processing, it has had the advantage that it operates at relatively high speed.
However, since the third prior art executes comparison by bit processing, it preparatorily limits the letters that can constitute a search key. In the third prior art document, the search key is limited to the alphabet. Therefore, it is very difficult to apply this method to the similarity search of Japanese having a number of types of character or to the similarity search of a word string to be executed with words defined as edition units. Furthermore, it is also possible to know the result without executing comparison to the last of the search key in a certain condition. Moreover, in the third prior art method, there has been the problem that no result can be obtained until the execution of calculation to the last of the search key is completed.