The pattern matching usually refers to searching for predefined keywords in text data. The pattern matching problem is a basic problem in computer science, and the research content has important values in many fields, for example, information retrieval and pattern identification, and plays an important role in applications such as spell check, language translation, data compression, search engine, intrusion detection, content filtering, computer virus feature code matching, and gene sequence comparison. For example, in some information acquisition and text editing applications, a user may designate some keywords, so that the positions of the keywords can be quickly positioned in the text.
The Aho-Corasick algorithm is simple and effective, in which all the positions of a finite number of keywords can be located in any text. The principle of the algorithm is as follows: firstly a finite state pattern matching machine is defined according to the series of keywords, and then the text is used as an input to the pattern matching machine. As long as the keywords are matched, it is notified that the keyword matching is successful.
For example, a user designates a keyword set {he, she, his, hers}, and the user desires that once any keyword occurs in the text search, a search result is output and notified to the user. As shown in FIG. 1, a procedure of performing the pattern matching through the Aho-Corasick algorithm is described in the following. A Goto function, a Failure function, and an Output function are generated according to the keyword set {he, she, his, hers} in two steps: in a first step, a state and the Goto function are confirmed, and in a second step, the Failure function is computed. The construction of the Output function is started in the first step, and is completed in the second step.
As shown in FIG. 1(a), the Goto function is generated according to the keyword set (in formulas, g represents the Goto function).
The Goto function g decides a Goto state according to a current state and an input symbol. For example, when h is input in 0 state, 0 state goes to 1 state, which is represented by g(0, h)=1; when h is input in state 3, state 3 goes to state 4, which is represented by g(3, h)=4. If a certain symbol is input in a certain state, and the goto is unsuccessful, it indicates that the Goto function fails, and a result is represented by fail; for example, e is input in state 3, and the Goto function fails, it is represented by g(3, e)=fail.
The Output function indicates an output result when the matching in the certain state is successful, for example, Output(2)=“he”.
To construct the Goto function, a Goto directed graph is constructed. The directed graph begins with one vertex which represents an initial state 0. Then, the keywords are input into the directed graph, and a path that begins from the initial state is added. New vertices and edges are added to the directed graph when the path is added, so that beginning from the initial state, a path spells out a complete keyword. The path terminates in the certain state, and the state is added to the Output function.
It is assumed that {he, she, his, hers} is the keyword set, and a procedure of generating the finite state machine is as shown in FIG. 2.
A first keyword “he” is added to the directed graph, and a result as shown in FIG. 2a is obtained. The keyword “he” is spelt from state 0 to state 2, and the output “he” is associated with state 2, that is, Output(2)={he}.
A second keyword “she” is added to the directed graph, and FIG. 2b is obtained. The output “she” is associated with state 5, that is, Output(5)={she}.
A third keyword “his” is added to the directed graph, and a result as shown in FIG. 2c is obtained.
It should be noted that when the keyword “his” is added, an edge labeled h from state 0 to state 1 already exists, so it is unnecessary to add another edge labeled h from state 0 to state 1. The output “his” is associated with state 7, that is, Output(7)={his}.
The last keyword “hers” is added, and a result as shown in FIG. 2d is obtained. The output “hers” is associated with state 9, that is, Output(9)={hers}. In the adding procedure, the existing edge labeled h (from state 0 to 1) and the existing edge labeled e (from state 1 to 2) may be used.
To complete the construction of the Goto function, a loop needs to be added, in which the loop is from state 0 to state 0 and corresponds to all input symbols other than h or s. The obtained result is shown in FIG. 1(a).
The Failure function (represented by f in formulas) represents which state should be jumped to for continuing the matching when the Goto function fails in a certain state. As shown in FIG. 1(b), in state 5, when an r symbol is input, g(5, r)=fail, and f(5)=2 is called for processing, that is, state 5 firstly jumps to state 2, and then g(2,r) is invoked for processing. The reason is that the Failure function represents the goto from a state to another state, which is equivalent to adding a link between the two states, and usually the direction of the Failure function in a certain state is called a Failure link of the state.
The Failure function is constructed based on the Goto function. Firstly, the depth of state s in the goto directed graph, that is, a length of the shortest path from the initial state to s, is defined. In FIG. 1(a), the depth of the initial state is 0, the depth of states 1 and 3 is 1, and the depth of states 2, 4, and 6 is 2, and so on.
An algorithm of computing the Failure function fin a state is as follows. Firstly, the Failure state of all states having the depth of 2 is computed, and then the Failure state of all states having the depth of 2 is computed, until the Failure state has been computed for all states. Thus, the Failure function of the state machine is constructed.
The Failure state of all states having the depth of 1 is set to 0. It is assumed that the Failure state of all states having the depth smaller than d is already computed, the Failure function of state s having the depth of d can be deduced from non-fail values of the Goto function of the states having the depth of d-1.
Firstly, each state r having the depth of d-1 is considered, and the following processing is executed.
1. If g(r.a)=fail for all input symbols a, the processing of state r is terminated.
2. Otherwise, for the situation that each input symbol a generates the output g(r.a)=s, the following operations are executed.
(a) A state variable state=f(r) is set.
(b) A formula state←f(state) (e.g. the f(state) is assigned to the state variable state) is executed for zero or more times, until a value of state satisfies g(state, a)≠fail (since g(0,a)≠fail, the appropriate value of the state will always be found.).
(c) f(s)=g (state, a) is set.
For example, in order to compute the Failure function in FIG. 1(a), f(1)=f(3)=0 is firstly set, because 1 and 3 are the states having the depth of 1. Then, the Failure function of the states 2, 4 and 6 having the depth of 2 is computed.
To compute f(2), state=f(1)=0 is set; and since g(0, e)=0, f(2)=0.
To compute f(6), state=f(1)=0 is set; and since g(0, i)=0, f(6)=0.
To compute f(4), state=f(3)=0 is set; and since g(0, h)=1, f(4)=1.
Through the computation according to the ideas, the complete Failure function is obtained, as shown in FIG. 1(b).
During the computation of the Failure function, the Output function is updated. When it is confirmed that f(s)=s′, and states s and s′ are both the Output state, the output set of s′ is combined with the output set of state s. For example, according to FIG. 1(a), it is confirmed that f(5)=2. At this point, the output set of state 2, namely {he}, is combined with the output set of state 5, so as to obtain a new output set {he, she}. The final Output function is as shown in FIG. 1(c).
In the following, the matching procedure of the state machine is further described with an example.
For example, a text “sshe” is input for searching. When the first s is input, g(0, s)=3, so the state goes to state 3. When a second s is input, the Goto function fails, g(3, s)=fail, so the Failure function is called, and f(3)=0 indicates that the state jumps to state 0, and then g(0, s)=3, so that the current state is still state 3. When h is input, g(3, h)=4, and the state goes to state 4. When e is input, g(4, e)=5, and the state goes to state 5. Since Output(5)={she, he}, during the search, two keywords {she, he} predefined by the user are found.
In the implementation of the present invention, the inventors find that the existing Aho-Corasick algorithm at least has the following problem. That is, after Failure reaches a certain state, the Goto function still fails sometimes, and Failure needs to jump to other states. That is, the existing Aho-Corasick algorithm has low efficiency in processing the Failure links, and has many low efficient Failure links.
The complicated state machine as shown in FIG. 3 is taken as an example, in which dashed lines represent the Failure links, q0 indicates the initial state. For example, in q14 state, g(q14, e)=fail, so it jumps to state q27 according to f(q14)=q27. However, in state q27, g(q27, e)=fail, so it jumps to f(q27)=q19. However, g(q19, e)=fail, so it jumps to f(q19)=q0, that is, through g(q14, e)→f(q14)→g(q27, e)→f(q27)→g(q19, e)→f(q19), it jumps to q0 state.
Referring to FIG. 4, the low efficient Failure links are shown, f(q6)=q0, f(q5)=q4, f(q4)=q3, f(q3)=q2, f(q2)=q1, and f(q1)=q0. If a symbol c is input in state q4, g(q4, c)=fail, it jumps to state q3 according to f(q4)=q3. However, g(q3, c)=fail, it jumps to state q2 according to f(q3)=q2. However, g(q2, c)=fail, it jumps state q1 according to f(q2)=q1. However, g(q1, c)=fail, it jumps to the initial state q0 according to f(q1)=q0. Till now, the jump of the Failure links ends, and through g(q4, c)→f(q3)→g(q3, c)→f(q2)→g(q2, c)→f(q1)→g(q1, c)→q0, it jumps to the q0 state.
It may be known that the Aho-Corasick algorithm has low efficiency in processing the Failure links, and has many low efficient Failure links.