Recently, the amount of information in various fields is increasing. In some fields, the amount of information is reaching the order of gigabytes to terabytes and it is becoming difficult to search for desired data from an enormous amount of data in a brief period of time.
Pattern matching used for searching for arbitrary text patterns is used in a variety of fields, such as word-processing software and database searches.
Conventionally, various techniques have been proposed for searches using pattern matching. Japanese Laid-open Patent Publication No. 2005-70911 describes a search device and a search method using deterministic finite automaton.
FIGS. 1A and 1B are diagrams explaining the Sigma algorithm, which is one of keyword search algorithms. FIG. 1A illustrates an example, in which keywords, such as “blue”, “green”, “red” and “yellow”, are searched for from a document to be searched. As illustrated schematically, an automaton corresponding to each keyword condition is created. Specifically, an automaton is created, in which transition is made from the root denoted by 0 to the head characters “b”, “g”, “r” and “y” of the respective keywords, and transition is further made to the characters of the respective keyword strings. If the characters to the last one of each keyword are matched, the keyword string is searched for. The keyword string of “green” includes “re” and if there is “d” after this, it matches with “red”, and therefore, there exists a path in which transition is made from “green” to “red” at “d” on the way to “green” as illustrated schematically.
Search starts from the root and if there is a keyword within the automaton, transition is made sequentially, and if a character that has nothing to do with the keyword is read, the search returns to the root. For example, when a document that includes “black” in FIG. 1B is input, transition is made to the head character “b” of “blue” and then transition is further made to the next “l”, however, the next character is “a”, and therefore, the search returns to the root. If the document includes a keyword string of the automata, transition is made to the last character of the keyword string and HIT information is output and thus the existence of the keyword string is found.
In the automaton search, the place of a character in each keyword string is denoted by a node or “state”. For example, the root is denoted by state 1, “b” in “blue” is denoted by state 2 and “u” is denoted by state 4.
FIG. 2 is a diagram illustrating a schematic configuration of a search device that uses the deterministic finite automaton. A document to be searched 11 stored in a disc device, etc., is input and a character byte code 12 is taken out sequentially therefrom. In the following explanation, it is assumed that the character byte code is 8 bits and 256 entries are formed. Further, an index that is a combination of a current state (data that indicates the state) 13 held in a register and the character byte code 12 is assumed to be an input address 14 of a memory 15 in which a search keyword automaton is formed. The memory 15 to which the input address 14 is input outputs output data 16 that includes a next state 17 and HIT information. The next state 17 is replaced with the current state 13.
FIGS. 3A and 3B are diagrams explaining the operation of the search device in FIG. 2. For the sake of simplification of explanation, an example is illustrated, in which a keyword string of “fo” is searched for. As illustrated in FIG. 3A, transition is made from state 1 to state 2 if “f” (0x66) appears, and transition is made to state 3 if “o” (0x6F) appears in state 2, and if “f” (0x66) appears again, state 2 is maintained and in other cases, the search returns to state 1.
This search keyword automata is illustrated in FIG. 3B. The character byte code is 8 bits and each state has 256 entries (addresses). Memory part M-X on the left-hand side indicates a transition destination from state 1 and memory part M-Y in the center indicates a transition destination from state 2, and memory part M-Z on the right-hand side indicates a transition destination from state 3. The memory part M-X stores a transition destination for an address that combines 0xXXXX part indicative of state 1, XX part indicative of a plurality of keyword strings, and 00-ff part indicative of the entry in state 1 of each keyword string. With 66 of 00-ff indicative of an entry, transition is made to state 2 and in other cases, state 1 is maintained again, and therefore, in 0xXXXX_XX66, the address 0xYYYY_YY00 of state 2 is stored and the address 0xXXXX_XX00 of state 1 is stored for the other addresses. The hit information HIT in the memory part M-X is N/A indicative of being not hit. As for the other keyword strings, they are stored after XX in the lowest three or four digits are changed.
Similarly, the memory part M-Y stores a transition destination for an address that combines address part 0xYYYY indicative of state 2, YY indicative of a plurality of keyword strings, and 00-ff indicative of the entry in state 2 of each keyword string. With 6f of 00-ff indicative of an entry, transition is made to state 3 and with 66, state 2 is maintained, and in other cases, the search returns to state 1, and therefore, in 0xYYYY_YY6f, the address 0xZZZZ_ZZ00 of state 3 is stored and the address 0xYYYY_YY00 of state 2 is stored in 0xYYYY_YY66, and the address 0xXXXX_XX00 of state 1 is stored for the other addresses. The hit information HIT in the memory part M-Y is N/A indicative of being not hit.
Similarly, the memory part M-Z stores a transition destination for an address that combines address part 0xZZZZ indicative of state 3, ZZ indicative of a plurality of keyword strings, and 00-ff indicative of the entry in state 3 of each keyword string. With 66 of 00-ff indicative of an entry, transition is made to state 2 and the search returns to state 1 in the other cases, and therefore, in 0xZZZZ_ZZ66, the address 0xYYYY_YY00 of state 2 is stored and the address 0xXXXX_XX00 of state 1 is stored for the other addresses. The hit information HIT in the memory part M-Y is HIT indicative of being hit. Consequently, if the search reaches the memory M-Z in state 3, the keyword exists.
FIG. 4 is a flowchart illustrating the processing of a search device that makes use of the above-mentioned automata and search processing using the same.
In step 101, a search condition is input. In the above-mentioned example, “fo” is input. In the example in FIG. 1, “blue”, “green”, “red” and “yellow” are input.
In step 102, an automation is created based on the input search condition, i.e., the keyword string.
In step 103, an automaton is constructed in a memory.
In step 104, a search is performed using the automaton in the memory with a document to be searched as an input.
In step 105, a search result is output.
FIG. 5 illustrates a flowchart illustrating the construction processing of the automaton in the memory in step 103 in more detail, illustrating, for example, processing to construct a table as illustrated in FIG. 3(B) in the memory.
In step 111, a table is created for each node (state) of the automaton.
In step 112, in order to represent a side (transition destination) of the automaton, a head point (address) of the table of a transition destination node is written in an entry of the table.
In step 113, a hit flag is written to the table in the last state.
FIG. 6 is a flowchart illustrating search processing using the automaton constructed in the memory in step 104 in more detail.
In step 121, next one character is read from the document to be searched and in step 122, whether it is the end of the document is determined. When it is the end of the document, the processing proceeds to step 105, and when not, the processing proceeds to step 123.
In step 123, data is read from the memory on which the automaton is constructed with the current state and the characteristic code of the read character as an input address.
In step 124, whether the keyword is hit in the data that is read is determined, i.e., whether a state that includes HIT information is reached is determined, and when not hit, the processing proceeds to step 126 and when hit, the processing proceeds to step 125.
In step 125, HIT information is updated.
In step 126, the “current state” is updated to the “next state” included in the data read in step 123 and the processing returns to step 121.
The construction of the search keyword automaton in the memory and the keyword search that makes use of the automaton are explained as above, however, an attempt is also made to speed up the search by preparing a plurality of memories on which the same search keyword automaton is constructed, dividing a document to be searched into a plurality of parts, supplying each part to each memory in parallel, and performing a search in parallel. It is also intended to speed up the search by loading the search keyword automaton on a cache memory.
The search device and the search method that make use of the conventional automaton described above have a problem that the size of the automaton is very large when the number of search expressions is large or when the search expression is complicated. When the size of the automaton increases, the amount of memory used increases, and there arises a problem that a cache error occurs frequently at the time of processing by a processor, etc., and therefore, the search speed is reduced, or a problem that the cost of hardware is increased.
Japanese Laid-open Patent Publication No. 2005-242668 describes a search device and a search method that reduces the size of the memory of automaton and improves throughput by applying a hash function to text.
However, the method described in Japanese Laid-open Patent Publication No. 2005-242668 uses the hash function, and therefore, there is a possibility that a text character string that is not matched is detected as a matched one and further, there is a problem that the memory size of an automaton may not be reduced sufficiently.