1. Field of the Invention
The present invention relates to a pattern search apparatus and method for finding a search pattern in a search target, and more particularly to a character string search apparatus and method for searching a character string to be searched, that is, a keyword, for character strings as text.
2. Description of the Related Art
As information has been put into an electronic form in recent years, the demand for a pattern search apparatus for finding a pattern in information put into the electronic form becomes greater. Especially, as coded text increases, the degree of importance of the character string search apparatus becomes greater.
Furthermore, user needs vary with the increase of information such as text, etc. readable by machine. To satisfy such varying user needs, the demand for improving search speed becomes greater, in correspondence with a search condition such as the number of search patterns, the number of keywords, lengths of the search patterns or keywords, etc.
For example, in a conventional character string search apparatus, one of the following character string search methods is employed to search a character string depending on a use purpose of the conventional character string search apparatus.
The first method is the Boyer-Moore method (abbreviated to the BM method hereinafter), which can perform an efficient search in the case where the number of keywords is only one. The following document refers to this method.
Boyer R. S. Moore J. S.: A Fast String Searching Algorithm, CACM Vol.20 No.10, pp.762 (1977) PA1 Aho. A. V. Corasick M. J.: Efficient String Matching: An Aid to Bibliographic Search, CACM, Vol.18, No.6, pp.333 (1975) PA1 Noriyoshi URAYA: FAST: A Fast Algorithm for Matching Multiple Patterns, IPSJ (Information Processing Society of Japan), Vol.30 No.9, pp.1119 (1989)
Provided here is the explanation about a search implemented by the BM method by referring to a specific example. Let's consider the case where a keyword "AT-THAT" is searched for in an English text, as a target. With the BM method, the keyword is first corresponded to the text, so that the first character of the keyword is aligned with the beginning of the text. Then, a first comparison is made between a character "F" of the text and the last character of the keyword (as indicated by an arrow, below). This state is as shown below. ##STR1##
Since the character "F" of the text does not exist in the keyword as a result of the comparison, the keyword is shifted to the right by 7 characters (the length of the keyword), without making a comparison between the 6 characters preceding the last character "T" of the keyword and the corresponding characters of the text. This state is as shown below. ##STR2##
Next, comparison is made between a dash "-" at a location corresponding to the last character of the keyword in the text, and the characters of the keyword. As a result, the dash "-" in the text is included as the third character from the beginning of the keyword (fifth character from the end), the keyword is shifted to the right by four characters. This state is shown below. ##STR3##
Since the last character of the keyword matches the corresponding character of the text, comparison is made between the preceding characters in the keyword and the text. The character "L" preceding the character "T" in the text is not included in the keyword. Accordingly, the keyword is shifted to the right by 6 characters from the above described state, so that the end of the keyword corresponds to the location shifted right by 7 characters from the character "L". This state is as shown below. ##STR4##
Although the last character and its preceding character of the keyword "AT" match characters positioned at the corresponding locations in the text, no match is found between the other preceding characters in the keyword and the text. Therefore, the keyword is shifted to the right by five characters, and a similar comparison is made. As a result of the comparison, all of the characters in the keyword match the corresponding characters in the text, which leads to finding the keyword in the text. ##STR5##
As described above, the amount of shifting of a keyword to the right using the BM method is determined by a location of a character in the text at which a mismatch with the keyword itself is found. A table indicating the amount of shifting of the keyword to the right may be prepared beforehand to search a character string.
The second method is the Aho-Corasick method (abbreviated to the AC method hereinafter) effective for the case where there are a plurality of keywords. The document referring to this method is given below.
With the AC method, a state transition diagram of a plurality of keywords is made for a search. The search is performed character by character from the beginning of the text, based on the state transition diagram. The search result is output when the keyword is ready to be output.
FIG. 1 is a schematic diagram exemplifying a keyword search according to the AC method. FIG. 1A is a schematic diagram explaining state transition in the case where there are four keywords such as "he", "she", "his", and "hers". A certain state, "0" in this case, is an initial state, and the state shifts to "1" if an "h" is detected. If an "s" is detected, this state shifts to "3". The state transition is made according to the state transition diagram. As shown in FIG. 1B, the result of the keyword detection is output when the state reaches the state "2", "5", "7", or "9".
FIG. 1C shows destinations of a "failure" function, that is, a state transition in the case where a target character is not detected. For example, if the state is "1" (i=1) and neither an "e" nor an "i" is detected, the state shifts to "0" (f(1)=0). If the "e" is not detected in the state "4", the state shifts to "1" (f(4)=1) in order to determine whether or not an "i" is detected. The state shifts to "2" (f(5)=2) in order to determine whether or not an "r" is detected (that is, whether or not "her" is detected), after "she" is detected in the state 5.
With such a state transition, a search based on a plurality of keywords is performed according to the AC method.
The third search method is the FAST method, implemented by combining both the matching from the right of a keyword according to the BM method and the simultaneous matching of a plurality of keywords using a state transition diagram according to the AC method.
FIG. 2 is a schematic diagram exemplifying a keyword search implemented with the FAST method. In this figure, search is performed for text using three keywords such as "state", "east", and "smart". Unlike the case shown in FIG. 1A, a state does not shift from left to right of a keyword, but shifts from right to left of the keyword in a state transition diagram (not shown in the drawing) according to the FAST method.
After such a state transition diagram is created, part of the text corresponding to the length of the shortest keyword among the keywords, from the beginning of the text, is regarded as a target, and a comparison is made between the characters starting from the rightmost one in the partial text and each of the three keywords, as shown in FIG. 2. Since the rightmost character "m" in the partial text does not match any of the rightmost characters of the three keywords, and this character exists in only the word "smart" in this figure, it is proved that the comparison is resumed after all of the three keywords are shifted to the right by three characters. The document referring to the FAST method is given below:
Conventionally, it was general practice to perform a character string search using any of the above described methods. By way of example, a method such as the above described BM method is employed for a apparatus such as a word processor, etc. in which a character string search is performed using only one keyword. In the meantime, a method such as the AC method is employed for a system such as a database system of a large size, in which a search is performed at the same time based on a plurality of keywords.
If search conditions such as the number of keywords are determined to some extent as described above, a relatively high search speed can be implemented by performing a search with a particular search method employed in a character string search apparatus. However, if the number of keywords is indefinite, if the keywords are of various lengths, or in a type of text to be searched, for example, the language that the text is written in is not identified, the optimum search method cannot be selected in correspondence with such search conditions.