The present invention relates generally to an information processing system. More particularly, the invention is concerned with a search team matching (or collating) technique in an information or character string retrieving (searching) system for making decision en bloc as to whether a set of plural character strings which are specified or designated for search or retrieval is present in an text composed of characters or character strings and subjected to retrieval.
In the field of the information processing systems, it is one of the important applications of the information processing to search or retrieve all documents containing a specific character substring (also referred to as a to-be-search or to-be retrieved term or simply as serch term) desired or specified to be retrieved or searched by a searcher or user from documents containing collections of character string data (hereinafter also referred to as the input text).
There have already been proposed several schemes of character retrieving or serching systems for realizing the retrieval or serching of the type mentioned above. In this conjunction, reference may be made to L. A. Hollaar: "Hardware Systems for Text Information Retrieval", ACM SIGIR 6th Conference, (1983). For a better understanding of the background of the invention, a typical one of the character string retrieving system proposed heretofore is shown in FIG. 2 of the accompanying drawings and will be described below in some detail.
Referring to FIG. 2, a character string retrieving system denoted generally by a numeral 1 includes a search or retrieval controller 101 which is in charge of overall control of the whole retrieval (search) system as well as communications or transactions with a host computer (not shown). More specifically, the retrieval controller 101 receives a retrieval (serch) request 201 from the host computer and analyzes the request to thereby send retrieval information 202 (i.e. information for search or retrieval) to a character string matching or collating unit 200 (also referred to as the matcher) and a composite condition check unit 103. Further, the retrieval controller 101 controls a storage controller 104 for allowing the character string matching unit 200 to read out character string data 204 from a character string data storage 105.
The character string matching circuit 200 checks the input character string data 204 as to presence of a character string which coincides with a serch term or character string designated by the retrieval request 201. When the corresponding search term is detected, information 205 for identifying the character string of interest is outputted to the composite condition check circuit 103. In response to the character string identifying information 205 as inputted, the composite condition check circuit 103 check whether logical conditions including logical ANDing(s) and/or ORing(s) which are also designated by the retrieval request 201 can be satisfied or not. When the designated composite conditions are found to be satisfied, the identification information of the relevant document as well as the content or text data of the document are sent back to the host computer as the retrieval or search result 206.
As a matching scheme for the matching or collation of character strings of concern in the character string matching unit 200 which constitutes an important part of the character string retrieving system 1, there is known a method of searching a plurality of character strings through a single text scan by resorting to the use of a finite automation. As a typical one of such methods, there may be mentioned a technique proposed by A. V. Aho et al.. (Reference may be made to A. V. Aho and M. J. Corasick: "Efficient String Matching", CACM, Vol. 18, No. 6, 1975). A hardware system designed for high-speed execution of automation based on the Aho et al's proposal is disclosed in J-P-A-60-105039. A character string matching circuit described in J-P-A-60-105039 will be described below by reference to a block diagram of FIG. 3 of the accompanying drawings.
The character string matching circuit of the prior art shown in this figure is composed of a character code register 211, a state transition table 220, a state ID (identification) number register 250 and a matching ID table 260.
For a better understanding of the teachings of the present invention, the character string matching operation performed by this prior art character string matching circuit will be elucidated below.
At first, through an initialization processing, there is generated an automation for performing matching or collation of search terms (i.e. character strings specified for retrieval) in the state transition table 220. Subsequently, the state number 0 representing the initialized state of the automation is placed in the state number register 250. The state number placed in the register 250 is referred to as the current state number and denoted by a reference numeral 305.
The matching operation starts from loading of a character in the character code register 211 from an input character string (input text) 204 on a character-by-character basis. Next, access is made to the state transition table 220 by using as the addresses therefor a character code 302 outputted from the character code register 211 and the current state number 305 outputted from the state number register 250, respectively, as a result of which a succeeding state number 303 indicating the state to which state transition is next to be made is read out from the state transition table 220. This succeeding state number 303 is held as a renewed or updated current state number as indicated at 305. In parallel with the access operation to the state transition table 220, the matching ID (identification) table 260 is accessed by using as the addressed therefor the current state number 305 outputted from the state number register 250 and the input character code 302, whereby the identification number or identifier of the search term is read out as a result of the matching as indicated at 205. In this conjunction, it is noted that when the identification number or identifier for the search term as read out from the matching ID table 260 is "0" (zero), this means that no matching could not been made with the search term.
By repeating a series of operations mentioned above, the character string matching can be accomplished.
Describing in more concrete this character string matching operation, reference is made to FIG. 4 of the accompanying drawings which shows a state transition diagram of an automaton.
In FIG. 4, there is shown a state transition diagram of an automaton for matching two search terms, i.e. "DOG" and "CAT", inputted by a searcher. In the figure, circles represent the states of the automaton, respectively, while arrows represent state transitions, respectively. Further, characters affixed to the arrows represent input characters which bring about the associated state transitions, respectively. Additionally, numerals shown as enclosed by the circles indicate the state numbers, respectively. The state 0 (zero) is the initial state of the automaton under consideration. For all the input characters for which no specific entries are found in conjunction with the transition, the automaton always assumes the initial state 0. The arrows 404 and 405 each affixed with a slash mark "/" represent the state transitions indicating that matching has been accomplished for "DOG" and "CAT", respectively. More specifically, the arrow 404 representing the state transition from the state 2 indicates that the matching has been accomplished for the term "DOG ". On the other hand, the arrow 405 representing the transition from the state 4 indicates that "CAT" has been matched or collated.
Now, description will be made of the character string matching operation of the circuit disclosed in J-P-A-60-105039 by reference to FIG. 4. The state transition of the automaton shown in this figure starts from the initial state 0 (zero). When the input character in the initial state 0 is "D", transition occurs to the state 1 while the input character "C" in the initial state 0 (zero) brings about the state transition to the state 3. When the input character is neither "D" nor "C", the automaton remains in the initial state 0 (zero). Similarly, in the state numbered 1, the input character of "0" brings about the state transition to the state 2, while for the input character of "C", the automaton transits to the state 3 with the input character of "D" bringing about the state transition to the state 1. For any input characters other than those mentioned above in the state 1, the automaton always resumes the initial state 0 (zero). When the input character is "G" in the state 2, transition represented by the arrow 404 takes place. This means that the result of the matching has been obtained for the search term "DOG". The above applies equally to the other state transitions as well.
FIGS. 5 and 6 of the accompanying drawings show exemplary structures of the state transition table 220 storing the automaton shown in FIG. 4 and the matching ID table 260, respectively.
At this juncture, it should be mentioned that JIS code is used as the character code. The state transition table 220 is implemented in such a structure which allows access thereto with the input character code 302 and the current state number 305 of the automaton under consideration. More specifically, when the current state number 305 has a value 0 (zero) with the input character code 302 being "D", the state number 1 (one) corresponding to 0 (zero) and "D" is outputted as the succeeding state number 303 to which the automaton should make transition in succession.
The matching ID table 260 stores the information that matching of the search terms has been accomplished as well as the information resulting from the matching as indicated by the arrows 404 and 405 in FIG. 4. In other words, the identification numbers of the search terms (hereinafter also referred to as the matching identifiers) are stored in the matching ID table 260 which can be addressed with the current state number and the trailing character code of the search term upon occurrence of the state transition in response to appearance of the trailing character thereof (e.g. the table 260 is addressed with the state 2 and the character "G" in the case of the search term "DOG" illustrated in FIG. 4). In the case of the illustrated example concerning the retrieval of "DOG", 1 (one) is stored as the matching identifier. Numerical values other than 0 (zero) represent the identification numbers or identifiers of the search term. By assigning the numerical values other than 0 (zero) to the matching identifiers while assigning 0 (zero) to those other than the objectives for matching, it is possible to discriminatively identify the output of the matching or collation as performed.
It will now be understood in what manner the retrieval processing is carried out in the case of the prior art character string retrieving system.
As a problem of the prior art system, it has to be first pointed out that the areas of the matching ID table 260 which are effectively made use of are only those which correspond to the transitions occurring upon appearance of the trailing characters of the search terms, as can be seen from FIG. 6. To say in another way, the matching ID table has undesirably to be implemented with such a capacity which corresponds to a product of the number of different types of characters and the number of the states in order to store only a small amount of search term identification information. In other words, a memory of a large capacity is required for the matching ID table inefficiently. In more concrete, in the case of the automaton illustrated in FIG. 4, there are required as many as 1024 slots for storing the search term identifiers in a number corresponding to 256 character species or types multipled with the four states even though only two slots are sufficient for the identification of the two search terms in reality.
As another example, let's assume that matching be performed on sixty-four search terms each composed of four characters. Since the number of the states of the automaton required for the matching of one search term is three, matching for the sixty-four search terms requires 192 states in total (=3 states.times.64 characters). Accordingly, the matching ID table has to be of a sufficiently large capacity for accommodating as many as 49,152 slots (=256 character species.times.192 states). However, the number of the slots which can effectively be used amounts actually to no more than 64 slots, i.e. only about one thousandth of all the slots.
As will be understood from the above, the matching ID table of the character string retrieving system disclosed in the J-P-A-60-105039 makes it necessary to use a memory of an enormously large capacity even for the matching of only a small number of the search terms, which ultimately leads to an expensive and large scale character string retrieving system, presenting a serious problem.
Another problem of the prior art character string retrieving system can be seen in that since one slot of the matching ID table can store no more than one matching identifier of the search term, it is impossible to cope with a multiple matching processing.
This problem will be discussed below in conjunction with an automaton shown in FIG. 7 of the accompanying drawings.
The automaton shown in this figure is so configured as to perform matching on two retrieval or serch terms "DOG" and"HOTDOG". In this connection, FIGS. 8 and 9 of the accompanying drawings show, respectively, the contents of the state transition table 220 and the matching ID table 260 which correspond to the automaton shown in FIG. 7.
Apparently, the term "DOG" is a substring constituting a so-called trailing character substring of "HOTDOG". Accordingly, when an input text containing "HOTDOG" is inputted, not only "HOTDOG" but also "DOG" will have to be outputted as the result of the matching. In other words, upon inputting of the trailing character "G" of "HOTDOG", the identification numbers for the two terms "HOTDOG" and "DOG" have to be outputted. In this conjunction, the scheme in which a plurality of terms are matched through the state transition process for a single input text, as mentioned above, is referred to as the multiple matching. Further, the terms susceptible or subjected to such multiple matching may be said as having "multiple matching relation". Now, turning back to the prior art character string retrieving system, it will be appreciated that the slot of the matching ID table 260 corresponding to the current state number 7 and the character code "G" can accommodate no more than one matching identifier, i.e. only the result of matching for "HOTDOG". Thus, it is impossible to output the result of matching for the term "DOG". To say in another way, the prior art character string retrieving system suffers from an additional problem that the aforementioned multiple matching processing can not be carried out, to a serious disadvantage.