In many information retrieval and text-editing applications it is necessary to be able to locate quickly some or all occurrences of user-specified patterns of words and phrases in text. The paper entitled “Efficient String Matching: An Aid to Bibliographic Search” by Alfred V. Aho and Margaret J. Corasick, Bell Laboratories describes a simple, efficient algorithm to locate all occurrences of any of a finite number of keywords in a string of text. The algorithm consists of constructing a finite state pattern matching machine to process the text string in a single pass. Construction of the pattern matching machine takes time proportional to the sum of the lengths of the keywords. The number of state transitions made by the pattern matching machine in processing the text string is independent of the number of keywords. The algorithm consists of two parts. In the first part we construct from the set of keywords a finite state pattern matching machine; in the second part we apply the text string as input to the pattern matching machine. The machine signals whenever it has found a match for a keyword.
The prior art Aho Corasick methodology will now be described as background.
A string is simply a finite sequence of characters. Let K=(y1,y2, . . . ,yk) be a finite set of strings which we shall call keywords and let x be an arbitrary string which we shall call the text string. Our problem is to locate and identify all substrings of x which are keywords in K. Substrings may overlap with one another. A pattern matching machine for K is a program which takes as input the text sting x and produces as output the locations in x at which keywords of K appear as substrings. The pattern matching machine consists of a set of states or nodes. Each state is represented by a number. The machine processes the text string x by successively reading the characters in x, making state transitions and occasionally emitting output. The behaviour of the pattern matching machine is dictated by three functions: a goto function g, a failure function f, and an output function output. FIG. 1 shows the functions used by a pattern matching machine for the set of keywords {he, she, his, hers}.
In the prior art technique, state (usually 0) is designated as a start or root node. In FIG. 1 example, the nodes are 0, 1, . . . ,9. The goto function g maps a pair consisting of a state and an input character into a node or the message fail. The directed graph in FIG. 1(a) represents the goto function. For example, the edge labeled h from 0 to 1 indicates that g (0,h)=1. The absence of an arrow indicates fail. Thus, g (1,σ)=fail for all input characters σ that are not e or i. All our pattern matching machines have the property that g(0, σ)≠fail for all input characters σ. We shall see that this property of the goto function on state 0 ensures that one input character will be processed by the machine every machine cycle.
The failure function f maps a node into a node. The failure function is consulted whenever the goto function reports fail. Certain nodes are designated as output nodes which indicate that a set of keywords has been found. The output function formalizes this concept by associating a set of keywords (possibly entry) with every node.
An operating cycle of a pattern matching machine is defined as follows. Let s be the current node of the machine and a the current character of the input string x.
1. If g (s,a)=s′, the machine makes a goto transition. It enters state s′, and the next character of x becomes the current input character. In addition, if output (s′)≠empty, then the machine emits the set output (s′) along with the position of the current input character. The operating cycle is now complete.
2. If g (s,a)=fail, the machine consults the failure function f and is said to make a failure transmission. If f(s)=s′, the machine repeats the cycle with s′ as the current node and a as the current input.
Initially, the current state of the machine is the start state and the first character of the text string is the current input character. The machine then processes the text string by making one operating cycle on each character of the text string. For example, consider the behaviour of the machine M that uses the functions in FIG. 1 to process the text string “ushers.” FIG. 2 indicates the state transitions made by M in processing the text string.
TABLE 1Sequence of node transitions.u s h e r s0 0 3 4 5 8 9
Consider the operating cycle when M is in state 4 and the current input character is e. Since g(4,e)=5, the machine enters state 5, advances to the next input character and emits output (5), indicating that it has found the keywords “she” and “he” at the end of position four in the text string. In state 5 on input character r, the machine makes two node transitions in its operating cycle. Since g(I5,r)=fail, M enters node 2=f(5). Then since g(2,r)=8, M enters node 8 and advances to the next input character. No output is generated in this operating cycle.
We say that the three functions g, f, and output are valid for a set of keywords if with these functions Algorithm 1 indicates that keyword y ends at position i of text string x if and only if x=uyv and the length of uy is i.
We shall now show how to construct valid goto, failure and output functions from a set of keywords. There are two parts to the construction. In the first part we determine the states and the “goto” function. In the second part we compute the failure function. The computation of the output function is begun in the first part of the construction and completed in the second part.
To construct the “goto” function, we shall construct a gala graph. We begin with a graph consisting of one vertex which represents the state 0. We then enter each keyword y into the graph, by adding a directed path to the graph that begins at the start state. New vertices and edges are added to the graph so that there will be, starting at the start state, a path in the graph that spells out the keyword y. The keyword y is added to the output function of the state at which the path terminates. We add new edges to the graph only when necessary.
For example, suppose {he, she, his, hers} is the set of keywords. Adding the first keyword to the graph, we obtain the trie of FIG. 2a. The path from state 0 to state 2 spells out the keyword “he”; we associate the output “he” with state 2. Adding the second keyword “she,” we obtain FIG. 2b. The output “she” is associated with state 5. Adding the keyword “his,” we obtain FIG. 2c. Notice that when we add the keyword “his” there is already an I edge labeled h from state 0 to state 1, so we do not need to add another edge labeled h from state 0 to state 1. The output “his” is associated with state 7. Adding the last keyword “hers,” we obtain FIG. 2d. The output “hers” is associated with state 9. Here we have been able to use the existing edge labeled h from state 0 to I and the existing edge labeled e from state 1 to 2. Up to this point the graph is a rooted directed tree. To complete the construction of the goto function we add a loop from state 0 to state 0 on all input characters other than h or s. We obtain the directed graph shown in FIG. I(a). This graph represents the goto function.
The failure function is constructed from the goto function. Let us define the depth of a state s in the goto graph as the length of the shortest path from the start state to s. Thus in FIG. 1(a), the start state is of depth 0, states I and 3 are of depth 1, states 2, 4, and 6 are of depth 2, and so on. We shall compute the failure function for all states of depth I, then for all states of depth 2, and so on, until the failure function has been computed for all states (except state 0 for which the failure function is not defined). The algorithm to compute the failure function f at a state is conceptually quite simple. We make f(s)=0 for all states s of depth 1. Now suppose f has been computed for all states of depth less than d. The failure function for the states of depth d is computed from the failure function for the states of depth less than d. The states of depth d can be determined from the non fail values of the goto function of the states of depth d-1.
Specifically, to compute the failure function for the nodes of depth d, we consider each node r of depth d−1 and perform the following actions.                1. If g(r.a)=fail for all a, do nothing.        2. Otherwise, for each character a such that g(r.a)=s, do the following:                    (a) Set node=f(r).            (b) Execute the statement node—f(node) zero or more times, until a value for node is obtained such that g(node, a)≠fail. (Note that since g(0,a)≠fail for all a, such a node will always be found.)            (c) Set f(s)=g (node, a).                        
For example, to compute the failure function from FIG. 1(a), we would first set f(1)=f(3)=0 since 1 and 3 are the nodes of depth 1. We then compute the failure function fro 2, 6 and 4, the nodes of depth 2. To compute f(2), we set node=f(1)=0; and since g(0, e)=0, we find that f(2)=0. To compute f(6), we set node node=f(1)=0; and since g(0, i)=0, we find that f(6)=0. To compute f(4), we set node=f(3)=0; and since g(0, h)=1, we find that f(4)=1. Continuing in we obtain the failure function shown in FIG. 1(b).
During the computation of the failure function we also update the output function. When we determine f(s)=s′, we merge the outputs of node s with the output of node s′. For example, from FIG. 1(a) we determine f(5)=2. At this point we merge the output set of state 2, namely {he}, with the output set of node 5 to derive the new output set {he, she}. The final nonempty output sets are shown in FIG. 1(c).
So far we have only discussed the case where there is only one failure link going from a particular node. In a refined version of the Aho-Corasick methodology discussed also in the paper, where there is a failure at a particular node there may be a multiple of failure links depending on the character under consideration. This is best described with reference to figure X which shows a table of the failure links for the same example above. The next move function is encoded in FIG. 3 as follows. In node 0, for example, we have a transition on to state 1, a transition on s to node 3, and a transition on any other character to node 0. In each node, the dot stands for any other character. This refined methodology is referred to hereinafter as extended link methodology, the previous defined as normal failure link. The invention described hereinafter is applicable to both.
One drawback of the known Aho-Corasick terminology, described above, lies in the need to recompile the structure if an update is made. This takes a considerable amount of processing power especially as the known Aho-Corasick methodology has to be built up in “breadth first” i.e. a depth at a time for each string.