The present invention relates to an information processing system, and in particular, to a retrieval or search method in an information search system for judging at a time whether or not a set of a plurality of character streams exists in character streams undergoing a search operation. The present invention is applicable to a search operation in a data base, a document filing system, a word processor, and the like.
In recent years, there has been increased the importance of a large-sized data base service including not only the secondary information (bibliographic information) but also the primary information (text) of, for example, literature information and patent information. For the information retrieval or search PROCESSING in such a data base, there has been conventionally adopted a method utilizing a key word and a classification code. According to this method, however, the range of the search operation can only be limited to several tens of items or several hundred items; consequently, there remains a problem with respect to the processing efficiency that the operator of the search operation is required to directly read the text at the final stage for the confirmation of the content. In addition, since the classified items are also changed with a lapse of time, there arises a problem that key words and classification codes must be appropriately updated. Furthermore, when registering a new document, because the operation to assign the key words (called an indexing operation) thereto takes a considerable period of time, a batch processing is effected for the registration of a considerable volume of data. As a result, there arises a problem of delay associated with the information search operation in that a certain period of time is required before the information of the document can be searched in the system.
As a method to cope with these problems, there has been considered an entire or overall text search system in which the operator can search the content of the text of a document by directly referencing the text based on an arbitrary key word or a free key word.
Particularly, in a case where a search operation is carried out by use of an arbitrary key word (free key word operation), terms with the different notations and synonymous terms are employed in many cases depending on the personal habit of the operator. For example, the codes of a long vowel (--) and (--) are used as -- (DATA) and -- (DATA), the code of the long vowel are present and absent in a term as " " (INTERFACE) and " " (INTERFACE), and a word with a difficult pronunciation is expressed as " " (INTERFACE) and " " (INTERFACE).
In addition, as an example of synonymous words for a computer, there are used " " (KEISANKI), " " (DENSANKI), and " " (COMPUTER). For a term including those represented in the different notations and the synonymous words may be considered in some cases. In consequence, when a kay word is specified with a plurality of free words, several hundreds of words including those in the different notations and the synonyms are to be generated.
As described above, in order to effect a search operation by use of a free word, there is required means for achieving at a high speed a search operation for many key words including synonyms and words expressed in the different notations.
On the other hand, there have been proposed several character stream searching apparatuses suitable for such an overall or entire text search system. With reference to FIG. 1 showing a representative constitution of the apparatus, a description will be given of the content thereof.
In a character stream search apparatus 1, a search controller 101 controls the overall search apparatus and achieves communications with a host computer. That is, the search controller 101 receives a search request 201 sent from the host computer, analyzes the search request 201, and then delivers the analyzed request 201 as search information 202 to a term comparator 200 and a query resolver 103. In addition, the search controller 101 controls a disk controller 104 so as to transmit character stream data 204 stored in a search data base 105 to the term comparator 200.
The term comparator 200 achieves a check to determine whether or not there exists a data item input character data 204 which matches with the search request. If this is the case, the term comparator 200 outputs information 205 identifying the character stream to the query resolver 103, which in turn checks the data, based on the pertinent character stream identification information 205, to determine whether or not a complex condition including a positional interrelationship specified in the search request is satisfied. If the complex condition is satisfied, the query resolver 103 supplies the host computer with a search result 206 including pointer information of the pertinent document and text data of the content of the document.
As a collation method of collating character terms in the term comparator 200 as a primary component of the character term search apparatus 1, there has been known a method in which a plurality of character terms are searched through one scanning operation by use of a finite state automaton (FSA). As representative methods, there are two methods as described by A. V. Aho and M. J. Corasick in the "Efficient String Matching", CACM, VOL. 18, No. 6, 1975.
First, referring to FIG. 2, description will be given of the first method (to be referred to as method 2 herebelow). FIG. 1 shows a state transition diagram of an FSA in a case where three character streams including "ABX", "CABY", and "DCABZ" are searched from character stream data. In this diagram, a circle denotes a state of the FSA and an arrow indicates a transition of the FSA state. An alphabet letter assigned to each arrow designates an input character for which the associated state transition takes place. An arrow 403 designates a transition to the initial state. A numeric value marked in each circle stands for a state number of the pertinent state. The characteristic feature of the method 1 resides in that all possible state transitions are represented in the FSA. Consequently, there also exist transitions from the respective states to the state 0; however, in order to avoid the complexity of the diagram, the arrows associated with the transitions to the state 0 from the states other than the state 0 are omitted in this diagram.
Next, the collating operation of the method 1 will be described. The initial state of the FSA is the state 0, namely, the state transition starts from the state 0. For the input characters "A", "C", and "D", the state transition takes place from the state 0 to state 1, state 4, and state 8, respectively. In the state 0, in a case where a character other than "A", "B", and "C" is inputted thereto, the state is returned to the state 0. Similarly, in the state 1, if the input character is "B", "C", and "A", the state is changed to the state 2, state 4, and state 1, respectively. The state 1 is returned to the state 0 if a character other than "B", "C", "D" and "A" is inputted. In the state 2, if the input character is "X", the state is changed to the state 3 which is the end point of the FSA, thereby attaining a character stream "ABX" as the result of the search. For other states, the state transitions are effected in the similar fashion.
As described above, according to the method 1, the state transitions associated with all possible input characters are represented in the FSA. In consequence, the number of state transitions of the FSA is increased and hence there arises a problem that a considerably long period of time is required to generate the FSA. A hardware system in which this method is implemented has been described in the JP-A-60-105039 and JP-A-60-105040.
Next, referring to FIGS. 3 and 4, description will be given of the second method (to be referred to as method 2 herebelow). Like the FSA of FIG. 2, the FSA of FIG. 3 is used to search three character streams including "ABX", "CABY", and "DCABZ" from data comprising character streams. FIG. 4 is an explanatory diagram useful to explain a failure function table indicating transition destinations to be employed in cases where characters other than those included in this FSA are inputted.
Operations will now be described according to the method 2. The initial state of the FSA is state 0. In a case of this example, for the input character "A", "C", and "D", there occur state transitions to the state 1, state 4, and state 8, respectively. When a character other than these letters is inputted, the state is returned to the state 0. On the other thand, for the state 1, if "B" is received as an input character, the state is changed to the state 2. In this situation, when a character other than "B", namely, a character not described in the FSA, for example, "D" is inputted, this condition is called a state failed in this method and hence the failure function table of FIG. 4 is referenced. In the failure function table, there are stored state numbers of failure destinations where a retry is to be effected for the current state number. In this case, a value 0 associated with the failure destination or a state effected by failure is obtained depending on the current state number, and consequently, the state is changed to the state 0. Thereafter, a retry is effected for the input character "D" so as to change the state to the state 8. This provision is called a failure function.
In the method 2, by adopting the failure function, as shown in FIG. 3, the number of transitions are greatly reduced as compared with the method 1 (FIG. 2). However, the method 2 is attended with the following problem. For example, let us consider a case where a character stream "DCABX" is inputted. In this situation, the state of the FSA undergoes a transition as 0.fwdarw.8.fwdarw.9.fwdarw.10.fwdarw.11.fwdarw.6.fwdarw.2.fwdarw.3 so as to obtain a character stream "ABX" as a search result. In this operation, two transitions from the state 11 to the state 2 are caused by the failure function, that is, the failure function is employed two times for the collate processing of an input character "X". In consequence, according to the method 2 above, the state transition diagram of the FSA is simplified and the period of time to generate the FSA is advantageously minimized; however, the method 2 is attended with a disadvantage that the processing speed is lowered when a failure takes place. In addition, the failure may repeatedly occur a plurality of times for an input character. Consequently, due to the number of failure occurrences, the the number of characters processed in a unit of time is changed. As a result, for character streams inputted for search operations at a predetermined constant interval of time, there are required buffers, a synchronization control mechanism, and the like to establish a consistency of the processing speed with respect to the input character streams, which leads to a problem that the control of the apparatus becomes complicated.