1. Field of the Invention
The present invention relates to a method and apparatus for searching a text for a specific character string and a recording medium recording a search program and, more particularly, to a symbol string search method and apparatus useful for an approximate matching (approximate search) technique of searching for similar symbol strings and a recording medium recording a symbol string search program.
2. Description of the Prior Art
Techniques of searching for a specific target symbol string from a large amount of one-dimensional nonstructural symbol string to be searched are used in various fields such as document search and gene sequence search. In practical fields, it is being required not only to increase the speed of search but also to develop an approximate matching technique which searches for a symbol string having certain mismatch as well as a perfectly matched symbol string. For example, on the basis of the principle of finite automaton, Japanese Unexamined Patent Publication No. 2-76072 entitled xe2x80x9cVariable Length Character String Detection Apparatusxe2x80x9d, Japanese Unexamined Patent Publication No. 3-131969 entitled xe2x80x9cSymbol String Search Method and Apparatusxe2x80x9d, and Japanese Unexamined Patent Publication No. 8-241335 entitled xe2x80x9cFuzzy Character String Search Method and System using Fuzzy Nondeterministic Automatonxe2x80x9d are proposed. These techniques will be collectively referred to as the prior art 1 hereinafter.
Separately, a technique which performs the approximate matching at a high speed by using sequential operations combining a simple bit shift operation and bit logic operation is proposed (Wu et. al., Fast String Matching Allowing Errors, Communications of the ACM, Vol. 35, No. 10, pp. 88-91). This technique will be referred to as the prior art 2 hereinafter. This prior art 2 will be briefly described below.
Assume that an m-bit target symbol string in an n-bit symbol string to be searched is to be searched for by allowing symbol insertion/deletion/replacement of k times or less. Insertion is mixing of an unnecessary symbol; e.g., xe2x80x98abccdexe2x80x99 has one symbol insertion with respect to xe2x80x98abcdexe2x80x99. Deletion is missing of a symbol; e.g., xe2x80x98abdexe2x80x99 has one symbol deletion with respect to xe2x80x98abcdexe2x80x99. Replacement is replacement of a certain symbol with another; e.g., xe2x80x98abadexe2x80x99 has one symbol replacement with respect to xe2x80x98abcdexe2x80x99.
Let the symbol string to be searched be T=t[1]t[2]t[3]. . . , t[n] and the target symbol string be P=p[1]p[2]. . . , p[m]. Search is executed by sequentially calculating a two-dimensional array of matching state bit strings R[i,j] (0xe2x89xa6ixe2x89xa6n, 0xe2x89xa6jxe2x89xa6k) each composed of m bits in increasing orders of i and J. R[i,j] indicates xe2x80x9cwhether matching is successful in the position of the ith symbol in the symbol string to be searched by allowing insertion/deletion/replacement of j times or lessxe2x80x9d. If the xth bit from the start bit of this R[i,j] is 1, this indicates that matching is successful in the position of the ith symbol in the symbol string to be searched by allowing mismatch of j times or less from the start symbol to the xth symbol in the target symbol string. Note that the left bit is the start bit in R[i,j]. For example, if R[10,1]=xe2x80x9811001xe2x80x99 during the processing, this indicates that matching is successful in the position of the 10th symbol in the symbol string to be searched by allowing mismatch of one time or less to the first, second, and fifth symbols in the target symbol string. The length of the target symbol string is m. Therefore, if the mth bit of R[i,j] is finally 1, this means that the target symbol string is detected in the position of the ith symbol in the symbol string to be searched by allowing mismatch of j times or less. Recurrence formulas of R[i,j] are
R[i,j]=B(j) (for i=0)xe2x80x83xe2x80x83(1-1)
R[i,j]=Sft(R[ixe2x88x921,0]) AND Msk(t[i]):(for i greater than 0, j=0)xe2x80x83xe2x80x83(1-2)
R[i,j]=(Sft(R[ixe2x88x921,j]) AND Msk(t[i])) OR R[ixe2x88x92j,jxe2x88x921] OR Sft(R[ixe2x88x921,jxe2x88x921) OR Sft(R[i,jxe2x88x921]) (for i greater than 0, j greater than 0)xe2x80x83xe2x80x83(1-3)
B(j) is an m-bit string in which the first j bits are 1s and other bits are 0s, respectively. Sft(R) is an operation of shifting a bit string R to the right by one bit (1 is set in an empty bit). For example, Sft(xe2x80x9810010xe2x80x99)=xe2x80x9811001xe2x80x99 for a five-bit string xe2x80x9810010xe2x80x99.
Msk(c) is an m-bit string in which a position where a symbol c exists in a target symbol string is 1 and other bits are 0s, respectively. Msk(c) and a target symbol string are matched from the start bit (leftmost bit). If no symbol c exists in a target symbol string, every bit in Msk(c) is 0. If the symbol c appears a plurality of times in a target symbol string, two or more bits in Msk(c) are 1s, respectively. For example, Msk(xe2x80x98axe2x80x99)=xe2x80x9810100xe2x80x99, Msk(xe2x80x98bxe2x80x99)=xe2x80x9801010xe2x80x99, Msk(xe2x80x98cxe2x80x99)=xe2x80x9800001xe2x80x99, and Msk(xe2x80x98dxe2x80x99)=xe2x80x9800000xe2x80x99 for a target symbol string xe2x80x98ababcxe2x80x99. xe2x80x9cANDxe2x80x9d is a logical product of bits, and xe2x80x9cORxe2x80x9d is a logical sum of bits.
The prior art 2 can perform search allowing k or less errors (insertion/deletion/replacement) by scanning a symbol string to be searched once in accordance with the above recurrence formulas. The total calculation amount is the sum of a calculation amount o(nk) of R[i,j] and the initialization time for forming Msk(c). The superior characteristic feature of this prior art 2 is that the processing time is reduced in proportion to the length of symbol string to be searched and to the number of allowable mismatched symbols while fuzzy collation is allowed. Also, high-speed processing is possible because the operation amount of recurrence formulas is very small.
One technique of increasing the speed of search of a large amount of symbol strings to be searched is a method using transposition information of a symbol string. This will be referred to as the prior art 3. In this method, a symbol string to be searched itself is not an object of search, and information (transposition information) indicating the position of each symbol in a symbol string to be searched is used in search. During search, only pieces of transposition information corresponding to symbols contained in a target symbol string are acquired from all pieces of transposition information. Symbol string search is performed by checking the position consistency between these acquired pieces of transposition information.
Generally, when the number of types of symbols forming a symbol string to be searched is large and the number of types of symbols contained in a target symbol string is small, the amount of acquired pieces of transposition information is very small compared to that of the original symbol string to be searched. For example, when a very long symbol string composed of several thousands of different symbols such as a Japanese text is searched for a word of a few characters, the amount of acquired pieces of transposition information is far smaller than that of the original document. This yields the following advantages in search using transposition information.
First, compared to search using a whole symbol string to be searched, the amount of data to be acquired for search is small, so the data transfer amount reduces. This increases the processing speed. Especially when a symbol string to be searched has an enormous amount and hence must be placed in a low-speed secondary storage device, a large increase in the processing speed can be expected. Second, search based upon perfect match between symbol strings can be performed by looking up each acquired transposition information at most once. Therefore, the search can be executed within a shorter time period than when a whole symbol string to be searched is scanned.
The problem of prior arts 1 and 2 described above is that the methods are based upon the assumption that a symbol string to be searched is a one-dimensional string, so a very long search time is necessary if the amount of symbol string to be searched is enormous. Especially when a symbol string to be searched has an enormous amount and hence must be placed in a secondary storage of a computer, this symbol string must be entirely transferred to a processing memory to perform search. Additionally, the whole transferred symbol string must be scanned at least once.
The problem of prior art 3 is that the approximate matching for a symbol string cannot be efficiently performed because the method is based only on perfect match between symbol strings. To apply the method to the approximate matching, it is possible to modify a target symbol string to form symbol strings considered to be matched in the approximate matching, repeatedly perform perfect match search by using these symbol strings, and combine the results. To search for a target symbol string xe2x80x98abcdexe2x80x99, for example, five symbol strings xe2x80x98bcdexe2x80x99, xe2x80x98acdexe2x80x99, xe2x80x98abdexe2x80x99, xe2x80x98abcexe2x80x99, and xe2x80x98abcdxe2x80x99 are to be allowed as one-symbol deletion. In addition to these five symbol strings, symbol strings obtained by insertion and replacement and all possible cases of symbol insertion/deletion/replacement to a desired number of times must be taken into consideration. Under the conditions, search is repeated, and the results are combined. Unfortunately, this processing produces an enormous number of symbol strings and hence is very time-consuming. The number of symbol strings can be more or less decreased by introducing a normal expression (e.g., xe2x80x98ab*cdxe2x80x99 is considered to be matched although xe2x80x9c*xe2x80x9d is an arbitrary character) such as a wild card which matches any symbol. However, a large increase in processing time is still unavoidable.
The present invention has been made to solve the above problems of the prior arts, and has as its object to provide a fuzzy symbol string search method and apparatus capable of efficiently searching for a symbol string by using transposition information of a symbol string to be searched, and a recording medium recording a symbol string search program.
To achieve the above object, according to one aspect of the present invention, an approximate matching is executed by performing fuzzy collation with a target symbol string by using transposition information of a symbol string to be searched.
According to another aspect of the present invention, an approximate matching is efficiently performed by acquiring only pieces of transposition information of the same symbols as contained in a target symbol string from preformed and prestored pieces of transposition information of a symbol string to be searched, and scanning the acquired pieces of transposition information at least once or just once.
According to still another aspect of the present invention, an approximate matching is efficiently performed by acquiring pieces of transposition information of the same symbols as contained in a target symbol string and scanning the acquired pieces of transposition information just once. To this end, regardless of whether the acquired pieces of transposition information are continuous or discontinuous in an original symbol string to be searched, information indicating the matching state of a symbol string with respect to each of the acquired pieces of transposition information is efficiently calculated.
The present invention having the above aspects can rapidly execute the approximate matching for a very long symbol string to be searched.
The first reason is that not pieces of transposition information of a whole symbol string to be searched but only pieces of transposition information corresponding to symbols contained in a target symbol string are acquired for search. This reduces the data transfer time required for the processing compared to a method in which pieces of transposition information of a whole symbol string to be searched are acquired.
The second reason is that the approximate matching allowing mismatch of a predetermined number of times can be efficiently executed by scanning the acquired pieces of transposition information only once while a bit operation with a small calculation amount is repeated.
Especially when a symbol string to be searched has a very large amount and hence is placed in a secondary storage of a computer or when the number of types of symbols forming a symbol string to be searched is large and the number of types of symbols contained in a target symbol string is comparatively small, a great increase in search speed is achieved. For example, although Japanese texts contain several thousands of different symbols, these texts are often searched for a word composed of a relatively small number of characters. The present invention well achieves its effect in a case like this.
The above and many other objects, features and advantages of the present invention will become manifest to those skilled in the art upon making reference to the following detailed description and accompanying drawings in which preferred embodiments incorporating the principles of the present invention are shown by way of illustrative examples.