Pattern matching generally refers to searching for predefined keywords in text data. Pattern matching is a basic topic in computer science. The research on pattern matching is important in multitudinous fields such as information retrieval and pattern identification, and is significant in the applications such as spelling check, language translation, data compression, search engine, intrusion detection, content filtering, computer virus characteristic code matching, and gene sequence comparison. For example, in the applications such as information obtaining and text editing, a user specifies some keywords, and the locations of the keywords need to be found in the text quickly.
The prior art provides an Aho-Corasick (AC) algorithm. The algorithm is simple and efficient, and can find all locations of a finite number of keywords in any text. The principles of the algorithm are to define a finite state machine according to a series of keywords, and then use the text as an input of the finite state machine. As long as a matched keyword is found, the algorithm reports success of matching the keyword. Depending on the number of bytes input at a time, the AC algorithm is categorized into: original AC algorithm (1 byte is input at a time), and multi-byte AC algorithm.
FIG. 1 shows a function constructed through the original AC algorithm in the prior art. Supposing the user specifies these keywords {he, she, his, hers}, the user expects to find any of such keywords in the text, and the algorithm outputs the search result and notifies it to the user. The process of performing pattern matching through the original AC algorithm is: Generating a state transition (goto) function (as shown in FIG. 1A), a failure function (as shown in FIG. 1B), and an output function (as shown in FIG. 1C) according to the keyword set {he, she, his, hers}. This process includes two steps: The first step is to determine the state and the goto function (represented by g), and the second step is to calculate the failure function (represented by f). The construction of the output function begins in the first step, and ends in the second step. Supposing the user inputs “sshe” for searching, when the user inputs “s”, g(0, s) is equal to 3, and therefore, the state is transitioned to 3; when the user inputs the second s, the goto function fails, and g(3, s) is equal to “fail”, and therefore, the failure function is invoked. f(3)=0 indicates a transition to state 0. Afterward, g(0, s) is equal to 3, and therefore, the current state is still state 3. When the user inputs “h”, g (3, h) is equal to 4, and the state is transitioned to 4. When the user inputs “e”, g(4, e) is equal to 5, and the state is transitioned to 5. Because Output(5)={she, he}, this search process finds two keywords predefined by the user: {she, he}.
In the original AC algorithm, the user inputs 1 byte at a time. To improve efficiency of the algorithm, the prior art provides an improved AC algorithm (namely, the multi-byte AC algorithm). The basic conception of the improved algorithm is as follows: n bytes are input at a time and detected. Because the location of the defined n-byte characteristic string in the data to be detected may be not on the integral multiple of n starting from the start position of the input data, the offset may be 0 to n−1 characters randomly. To avoid detection omission, it is necessary to detect all offset positions, that is, n state machines may be used concurrently to perform detection.
FIG. 2 shows a function constructed through a multi-byte AC algorithm in the prior art. Supposing the keyword combination specified by the user is {S1: technical; S2: technically; S3: tel.; S4: telephone; S5: phone; S6: elephant}, the detection step length n is 4 bytes. FIG. 2A shows a goto function constructed. Starting from state q0, the state is transitioned to q1 after “tech” is input, transitioned to q2 after “tele” is input, transitioned to q3 after “phon” is input, and transitioned to q4 after “elep” is input, and so on. Because the matched pattern is not necessarily an integral multiple of 4 bytes, short fields need to be added. In the failure function shown in FIG. 2B, short fields are added.
With the multi-byte AC algorithm, the step length of each transition is n bytes. In the original AC algorithm, the transition of n bytes involves n attempts of memory access; in the improved AC algorithm, the transition of n bytes involves only one attempt of memory access, and the access speed is 4 times of that of the original AC algorithm.
However, for both the original AC algorithm and the multi-byte AC algorithm, due to existence of the failure function, when the goto function fails, the AC pattern matching state machine needs to access the memory for at least one more time to read the failure function, which thus reduces the efficiency of the AC algorithm.
Moreover, for the multi-byte AC algorithm, the index of the state transition table includes not only the input state, but also the input strings. Therefore, each entry needs to store the complete failure chain. Otherwise, once the matching fails, the after-failure state is missing. FIG. 3 shows the state transition table of the multi-byte AC algorithm provided in FIG. 2. For the <q2, phon> entry in FIG. 3, if merely the first failed q3 state is stored, when the input string is still failed in the q3 state, it is necessary to search for the failure output state under the q3 input state. However, the failure output state of q3 is stored in the <q0, phon> entry, namely, the parent state of q3. Because it is impossible to deduce the previous q0 state of q3 according to q3 simply, it is impossible to search this table for a new failure function. Therefore, the work cannot go on.
Moreover, because the failure chain has a variable length and cannot be stored in an entry, another failure chain table needs to be created for storing the complete failure chain. Through a mechanism such as a pointer function, the failure in the original table is pointed to an address of a newly created complete failure chain table. Such processing methods are complicated and capricious, and require extra storage space and extra processing steps.
In conclusion, to improve efficiency of the AC algorithm, the prior art provides a method of eliminating a failure function for an original AC algorithm. In this method, a δ function is introduced in place of the goto function “g”, namely, a new goto function δ is obtained as a combination of all goto functions and failure functions. The creation of the introduced function is based on the goto functions and failure functions. After the δ function is created, the pattern matching state machine of the AC algorithm is composed of a finite state set “S” and the next-hop goto function δ. For each input character “a” under state “r”, the δ(s,a) function has an output state “s” that belongs to the finite state set “S”, that is, a definite output state exists for every input character. In this way, no failure function actually exists. In pattern matching, it is only necessary to execute state←δ(state,ai) simply. The process of generating the δ function includes the following steps:
1. Create a new null state set “S”.
2. For the initial state 0, use the state generated by every input goto function as the output state of the δ function, namely, δ(0, a)=g(0, a) If the output state r is not 0, add this state “r” to the null state set: S←S∪r.
3. Retrieve every new state “s” from the state set, and delete the state s from the state set S. For every input character “a” under the state s, perform the following step:
(1) If the output of the goto function g(s,a) is not “fail”, set the output state as the output of the δ function under this state, for example, δ(s,a)=g(s,a); or
(2) If the output of the goto function g(s,a) is “fail”, use the output state of the failure function of this output state as a new input state, execute the goto function δ, and use its output state as the output state of the goto function, namely, δ(s,a)=δ(f(s),a).
4. The foregoing process goes on until the state in the state set “S” is null.
FIG. 4 shows a state transition of eliminating failure after the δ function is applied. Taking state 8 as an example, FIG. 1 includes only a goto function that makes a goto transition to state 9, and a failure function that makes a failure transition to state 0. Now, after the combination, the failure function that makes a failure transition to state 0 changes to a new δ function that makes a transition to state 1 when h is input, where the input “.” indicates any character other than s and h. With such a δ function, it is not necessary to consider whether the function is a failure function or a goto function, and a definite step of a transition can be performed according to the δ function, and this improves the efficiency of the original AC algorithm.
However, in practical implementation, to ensure the operation efficiency, the target state of a transition for every state is a one-dimensional array composed of 256 units. The inputs of n bytes are n×256 possibilities, but the number of valid inputs is no more than 5. Such a storage mode leads to a resource waste.