The present invention relates to an information processing system and more particularly to a search term matching method and apparatus in a text search system and is concerned with a search which is effected by deciding en bloc whether or not a plurality of character substrings designated as a search term exist in a test character string.
The invention can be utilized for the search in a database, a document filing system and a word processor or the like.
In the field of the information processing systems, it is one of the important processings to search all documents which contain a particular word, i.e., a character string desired by a searcher (hereinafter referred to as the search term) from a document constituted by a set of character string data (hereinafter referred to as the text).
There are proposed several text search apparatuses for realizing such search systems. Of these proposed ones, a structure of a typical character string retrieval system is shown in FIG. 1 and will be described below (refer to L. A. Hollaar: "Text Retrieval Computers", COMPUTER, March 1979).
In the character retrieval system 1, a retrieval control means 101 is in charge of overall control of the whole retrieval system and communication with a host computer. More specifically, the retrieval control means accepts a search query 201 issued from the host computer, analyzes it and sends the results thereof as retrieval control information 202 to a character string matching means 102 and a complex or compound condition decision means 103. Further, the retrieval control means 101 controls a storage unit control means 104 for reading out character string data 204 from a character string storage means 105 into the character string matching means 102.
The character string matching means 102 checks the input character string data 204 as to whether or not a character string satisfying the search query 201, i.e., the search term, exists in the input character string data and, if it exists, outputs information 205 identifying the character string to the compound condition decision means 103. On the basis of character string identifying information 205, the compound condition decision means 103 verifies whether or not logical condition such as "AND" or "OR" condition specified in the search query 201 is satisfied. When the compound condition as designated is met, the identification information of a relevant document and text data representing the contents of the document are sent back to the host computer as a retrieval result 206.
As a character string matching scheme adopted in the character string matching means 102 which is vital to the character string retrieval system 1, there is known a method of searching a plurality of character strings through a single text scanning by using a finite automaton. As hardware for executing the finite automaton at a high speed, there may be mentioned those described in Japanese Unexamined Patent Application Publications Nos. 105040/1985 and 95672/1991.
However, attempt for realizing a high-speed character string matching means by employing these known techniques encounters problems, which will be discussed below.
The character string matching means proposed in Japanese Unexamined Patent Application Publication No. 105040/1985 is shown in FIG. 2. This character string matching means is designed for performing the matching for a text in which character codes such as Chinese characters each constituted by two bytes are used. The matching operation executed by this character string matching means 102 will briefly be elucidated below.
The character string matching means 102 is composed of a data register 20, a change-over circuit 21, an address register 7, an address decoder 9, a random access memory 8, a memory register 10 and a control circuit 22.
As the initialization, an initial state number "0" (zero) is set at the more significant byte of the address register 7. Further, a state transition table is set in the random access memory 8. For this reason, the random access memory 8 may herein be referred to as the state transition table.
The matching operation is started by fetching two bytes, i.e., character codes corresponding to one character into the data register 20 from the input text 204. The character codes of two bytes as fetched are divided into single-byte codes in the order of the more significant byte and the less significant byte by the selector 21 to be stored in the address register at the less significant byte position thereof.
Assuming now that the more significant byte is selected by the selector 21 and stored in the address register at the less significant byte position thereof, the value stored in the address register 7, inclusive of the more significant bytes, is sent to the state transition table 8 as the address for reference via the address decoder 9. From the state transition table 8, the (ID) number of destination state for transition is read out in correspondence to the above-mentioned address for reference and held in the memory register. This state number is outputted to the control circuit 22 in order to decide whether or not the state contains the result of the matching. When it is decided that the result of the matching is stored, a matching result (ID) number is outputted as a matching result 205.
Subsequently, the state number is stored in the address register 7 at the more significant byte. Next, the less significant byte of the data register 20 is selected by the selector 21 to be stored in the address register at the less significant byte position thereof. In succession, operation similar to that described above is repeated to carry out the character string matching.
As will be appreciated from the above, this character string matching means 102 is designed to perform the matching processing by decomposing the character code consisting of two bytes into two discrete bytes and transmitting them through the automaton. Thus, the state transition table 8 is referenced twice for a single character. As a consequence, about twice as long as the access cycle for the memory constituting the state transition table 8 is required for the matching of a single character, although the memory capacity for the state transition table 8 can considerably be reduced. Accordingly, search of a text composed of character codes each consisting of two bytes will reduce the matching throughput of the character string matching circuit 102 by a half when compared with the search of a text composed of character codes each composed of one byte, thus giving rise to a problem.
Next, problems from which the system described in Japanese Unexamined Patent Application Publication No. 98672/9991 suffers will be analyzed. In the case of the this system, the input character for which the matching is to be carried out in a given state of the automaton is represented by placing a mark termed the token. More specifically, every time one character is inputted from an input text, reference is made to the state for which the token is placed. Besides, the token is necessarily generated in the initial state every time an input character code is fetched. The matching operation is effectuated by referencing a state transition table by using as the address therefore the ID number of the state at which the token has been placed together with the input character code. Consequently, when a plurality of tokens exist in the automaton of concern, the state transition table is referenced a corresponding number of times for one character as inputted. As a result of this, the matching throughput is lowered by one severalth.
Concerning the matching operation in the case where a plurality of tokens are involved in the matching processing for one character, description will be made below by referring to an automaton shown in FIG. 3. This automaton is designed for matching en bloc for " , , " as well as spelling variants thereof " , , 6 ", " 7 , , ", " , , ", " , , ", " , , ", " , , ", " , , " and " , , ".
Upon inputting of " , , " as the input text, the token makes transition in a manner illustrated in FIG. 4. At first, when " " is inputted, a token 1 is newly generated from the state "0" which is the initial state. Since the transition due to " " is described in the state "0" (refer to FIG. 3), the matching is validated, whereby the token 1 is caused to make transition to the state "1".
Upon succeeding input of " ", a token 2 is newly generated in the state "0". However, since no description concerning the transition due to " " is found in this state, the matching is invalidated, resulting in that the token 2 makes disappearance. Further, the token 1 moved to the state "1" is caused to make transition to the state "2" because the matching with " " is validated in the state "1". It is thus apparent that the matching operation is performed twice for one character in this case.
Similarly, upon input of " ", " " and " ", the token 1 makes transitions successively to the states "3", "5" and "6" in this order. In the meanwhile, tokens 3 to 5 are generated as well. However, they make disappearance because the matching are not validated.
In this manner, similar processing is performed for " " and " " inputted in succession.
In the course of matching operation described above, the matching takes place fourteen times in response to the input of the text of seven characters.
A hitherto known character string matching means 102 designed for carrying out the character string matching processing described above is shown in FIG. 5. This character string matching means 102 is comprised of registers 211, 250 and 251, a state transition table 220, a matching result table 260, a selector 261, a gate 262, a multiplexer 263, buffers 280 and 281 and a comparator 252.
Now, matching operation of this character string matching means 102 will briefly be described.
An input text 204 is stored in the register 211 on a one-by-one character basis. A character code outputted from the register 211 is inputted to the state transition table 220 as the address information. The state transition table 220 is referenced by using the current state number 306 and the character code 302 as the address, whereby an ID number 303 of the transition-destined state to which transitions to be next made (hereinafter referred to as the succeeding state number) is outputted from the table.
In this character string matching means, the succeeding state number 303 is used as a token identifier. The succeeding state number 303 serving as the token identifier is stored in the buffer 260 or 281 via the gate 262 and the multiplexer 263, as the information indicating the position at which the token is present. When the succeeding state number outputted from the state transition table 220 is "0" (zero), i.e., when the initial state number exists, this means that there is no destination to which the token can be moved. Consequently, when the succeeding state number 303 is the initial state number "0", it is necessary to extinguish the token. The control to this end is performed through cooperation of the comparator 252 and the gate 262.
After having been stored in the register 250, the succeeding state number 303 is stored selectively in one of the buffers 280 and 281 via the gate 262 and the multiplexer 263. At that time, the token can be extinguished by controlling the gate 262. In this conjunction, decision as to whether or not the token is to be extinguished is made by the comparator 252.
More specifically, when the succeeding state number 303 is the initial state number "0" (zero), comparison with the state number "0" (initial state number) stored in the register 251 as performed by the comparator 252 results in equality. As a result of this, the gate 262 is closed, whereby the succeeding state number 303 is extinguished without being sent to the multiplexer 263. In contrast, unless the succeeding state number 303 is the initial state number "0" (zero), the succeeding state number 303 is sent out to the multiplexer 263 through the gate 262 to be thereby retained as the token.
There are stored in the buffer 280 and 281 at the start address thereof the initial state number as the initial value to thereby allow the succeeding state number 303 sent through the multiplexer 263 to be stored at the address succeeding to the initial state. In this manner, it is possible to make available the token without fail in the initial state.
The succeeding state number 303 is stored in either one of the buffer 280 or 281 to be read out therefrom as a current state number 305 for the succeeding character code matching.
In the selector 261, the buffer 280 or 281 in which the token, i.e., the succeeding state number 303, has been stored is selected, whereon the current state number 305 is sequentially read out from the buffer selected. Upon completion of the reading, a read end signal 307 is generated. There is established synchronism between the multiplexer 263 and the selector 261 such that the selector 261 selects the buffer 281 when the multiplexer 263 selects the buffer 280. On the other hand, when the buffer 281 is selected by the multiplexer 263, the buffer 280 is selected by the selector 261. In this manner, the token to be moved to the transition-destined state is stored as the succeeding state number 303 in the buffer which differs from the buffer in which the token in the transition-source state is stored (stored as the current state number).
Change-over between the buffers 280 and 281 is effected at a time point when the read operation from the buffer 280 or 281 selected by the selector 261 has been completed, i.e., at the timing at which the read end signal 307 is generated. Ordinarily, a character code is fetched from the text into the register 211 in synchronism with the register 250. The register 211 holds the character code until the read end signal 307 is generated and waits for the succeeding input until the tokens of the transition destinations, i.e., the current state numbers have been read out completely from the buffer. There is stored in the matching result table 260 a predetermined search term number identifying the search term in correspondence to the state where the search term terminates(hereinafter referred to as the termination state), while "0" (zero) is stored in the table 260 in the other states. Thus, the matching result 205 becomes meaningful only when the search term number outputted from the matching result table 260 in correspondence to the state number is other than "0".
A series of operations described above are repeated for each of the characters constituting the input text to thereby realize the character string matching processing.
As is appreciated from the foregoing, the state transition table is referenced once for the matching of one token according to the known technique. Accordingly, in the case of the exemplary operation illustrated in FIG. 4, the token matching is performed fourteen times for the input text composed of seven characters. In other words, the state transition table is referenced twice for one character on an average. Consequently, the matching throughput becomes lowered about a half when compared with the processing which can afford a single matching for one character, giving rise to a problem.
In the case of the two known techniques described above, it is required to make reference to the state transition table a number of times for the matching processing for a single character text. As a consequence, the matching processing cycle increases several times as long as the cycle time of the memory used as the state transition table. Thus, in order to implement the character string matching means of a high speed on the order of several ten megabytes per second, there is required the matching cycle on the order of several ten nanoseconds. This means that the inexpensive memory such as SRAM can no more be used but a high-speed memory such as DRAM must be employed. As an ultimate result, high costs will be involved in the implementation of the text search system, incurring a serious problem.