(a) Field of the Invention
The present invention relates to method and system for retrieving a data pattern and, more particularly, to method and system for use in extracting a character pattern in a document file including a relatively long character string.
(b) Description of the Related Art
A deterministic finite automaton technique is generally used for extracting a specific character string, or character pattern, included in a document file having a relatively long character string. The deterministic finite automaton technique is such that a state transition table is first prepared from a specific character pattern to be extracted from a subject document file, the state transition table is stored in a content addressable memory (CAM), and the subject document file is examined based on the state transition table to thereby extract the specific character pattern from the subject document file.
FIG. 10 shows a data retrieval system described in Patent Publication JP-A-62(1987)-179083. The system, generally designated by numeral 50, includes a register 76 for consecutively latching and storing therein a character in an input character string 500, a CAM 70 for retrieving characters in the input character string 500 one by one through the register 76, an associated data RAM 73 for operating in association with the CAM 70, a state number register 77 for storing a next state number 523 for the CAM 70, and a pattern number register 78 for receiving a pattern number of the retrieved character string to output the same.
The CAM 70 includes a current-state-number storage section 71 and a matching character storage section 72. The current-state-number storage section 71 stores therein a current state number before the current transition, whereas the matching character storage section 72 stores therein a matching character which is to be matched with the input character and constitutes a condition for the state transition.
The associated data RAM 73 includes a next-state-number storage section 74 and a pattern number storage section 75. The next-state-number storage section 74 stores therein a next state number specified by the current state transition, whereas the pattern number storage section 75 stores therein a pattern number of the character pattern retrieved by the current state transition.
The state number register 77 stores therein a next state number 523 received from the next-state-number storage section 74 to provide the same to the CAM 70 as a current state number 520 for the next retrieval. The pattern number register 78 receives the pattern number stored in the pattern number storage section 75 to output the pattern number 501 of the retrieved character pattern as the output of the data retrieval system 50.
FIG. 11 shows an example of the contents of the CAM 70 and the associated data RAM 73. In this example, the system 50 retrieves a pattern-1, “ABC”, having a pattern number “1” and a pattern-2, “BCD”, having a pattern number “2”, wherein each character pattern has a specific character string including a plurality of specific letters. In the CAM 70, each current state number stored in the current-state-number storage section 71 and each matching character stored in the matching character storage section 72 form an input entry having a specific address. The associated data RAM 73 has a plurality of output entries each having a next state number in the next-state-number storage section 74 and a pattern number in the pattern number storage section 75. Each output entry of the associated data RAM 73 is listed in association with a corresponding input entry of the CAM 70 in the state transition table.
FIG. 12 shows a state transition diagram showing the contents of the CAM 70 and the associated data RAM 73. In this diagram, each number in a circle corresponds to a state number, wherein the state number located at the base of an arrow corresponds to the current state number before the current state transition represented by the arrow, and the state number located at the tip of the arrow corresponds to the next state number after the current state transition. The letter attached with the arrow corresponds to the character retrieved from the input character string during the current state transition.
Although Patent Publication JP-A-62(1987)-179083 describes further a signal for representing success or failure in the retrieval by the CAM 70 as well as a reset circuit for resetting the state number register 77 if the CAM 70 fails to retrieve a character pattern, descriptions of these items are omitted here because these items are irrelevant to the problem of the conventional technique to be solved by the present invention.
In operation of the data retrieval system 50 shown in FIG. 10, the characters in the character string 500 of a document file are latched in the register 76 one by one. The CAM 70 receives a current state number 520 from the state number register 77 and a character 521 in the input character string 500 of the document file from the register 76, as a retrieval key for the current retrieval. The CAM 20 operates for a retrieval based on the state transition table and the retrieval key including the current state number 520 received from the state number register 77 and a character string received from the register 76, thereby delivering an address 522 of the input entry matched with the retrieval key.
The associated data RAM 73 receives an address 522 from the CAM 70, to output the contents of the output entry stored therein and specified by the address 522, the contents including a next state number 523 and the pattern number 524 of the character pattern retrieved by the CAM 70.
If the CAM 70 receives, for example, a state number “1” and a character “A” as a retrieval key, then the CAM 70 output an address “3” based on the state transition table shown in FIG. 11. In this case, the associated data RAM 73 receives the address “3” from the CAM 70 to thereby output a next state number “1” and a pattern number “0”, which indicates that no pattern is found in this retrieval, as shown in FIG. 11. This is also shown in FIG. 12 by a loop arrow attached with “A” and having a base and a tip on state number “1”.
In another case, if the CAM 70 receives a state number “2” and a character “C” as a retrieval key, then the CAM 70 outputs an address “7” to the associated data RAM 73, which outputs a next state number “4” and a pattern number “1”, which means a pttern-1 is found in the current retrieval. This is also shown in FIG. 12 by an arrow having a base on state number “2” and a tip on state number “4” and attached with “C”.
The next state number 523 is latched by the state number register 77 and used as the current state number 520 for the next retrieval in the CAM 70. The pattern number 524 is latched by the pattern number register 78, which outputs the same as the pattern number 501 of the retrieved character pattern. The data retrieval system 50 iterates the retrieval operation for all the characters in the document file 500, thereby retrieving patterns including pattern-1, “ABC”, and pattern-2, “BCD”.
In the conventional data retrieval system, there is a problem in that the speed of retrieval is extremely low because the CAM 70 has to wait a next state number being output as the result of the current retrieval from the associated data RAM 73, for operating the next retrieval.