Advances in network and storage-subsystem design continue to push the rate at which data streams must be processed between and within computer systems. Meanwhile, the content of such data streams is subjected to ever increasing scrutiny, as components at all levels mine the streams for patterns that can trigger time sensitive action. Patterns can include not only constant strings (e.g., “dog” and “cat”) but also specifications that denote credit card numbers, currency values, or telephone numbers to name a few. A widely-used pattern specification language is the regular expression language. Regular expressions and their implementation via deterministic finite automatons (DFAs) is a well-developed field. See Hopcroft and Ullman, Introduction to Automata Theory, Languages, and Computation, Addison Wesley, 1979, the entire disclosure of which is incorporated herein by reference. A DFA is a logical representation that defines the operation of a state machine, as explained below. However, the inventors herein believe that a need in the art exists for improving the use of regular expressions in connection with high performance pattern matching.
For some applications, such as packet header filtering, the location of a given pattern may be anchored, wherein anchoring describes a situation where a match occurs only if the pattern begins or ends at a set of prescribed locations within the data stream. More commonly, in many applications, a pattern can begin or end anywhere within the data stream (e.g., unstructured data streams, packet payloads, etc.). Some applications require a concurrent imposition of thousands of patterns at every byte of a data stream. Examples of such applications include but are not limited to:                network intrusion detection/prevention systems (which typically operate using a rule base of nearly 10,000 patterns (see Roesch, M., “Snort—lightweight intrusion detection for networks”, LISA '99: 13th Systems Administration Conference, pp. 229-238, 1999, the entire disclosure of which is incorporated herein by reference));        email monitoring systems which scan outgoing email for inappropriate or illegal content;        spam filters which impose user-specific patterns to filter incoming email;        virus scanners which filters for signatures of programs known to be harmful; and        copyright enforcement programs which scan media files or socket streams for pirated content.In applications such as these, the set of patterns sought within the data streams can change daily.        
Today's conventional high-end workstations cannot keep pace with pattern matching applications given the speed of data streams originating from high speed networks and storage subsystems. To address this performance gap, the inventors herein turn to architectural innovation in the formulation and realization of DFAs in pipelined architectures (e.g., hardware logic, networked processors, or other pipelined processing systems).
A regular expression r denotes a regular language L(r), where a language is a (possibly infinite) set of (finite) strings. Each string is comprised of symbols drawn from an alphabet Σ. The syntax of a regular expression is defined inductively, with the following basic expressions:                any symbol α∈Σ denotes {α};        the symbol λ denotes the singleton set containing an empty (zero-width) string; and        the symbol φ denotes the empty set.        
Each of the foregoing is a regular language. Regular expressions of greater complexity can be constructed using the union, concatenation, and Kleene-closure operators, as is well-known in the art. Symbol-range specifiers and clause repetition factors are typically offered for syntactic convenience. While any of the well-known regular expression notations and extensions are suitable for use in the practice of the present invention, the description herein and the preferred embodiment of the present invention supports the perl notation and extensions for regular expressions due to perl's popularity.
As noted above, regular expressions find practical use in a plethora of searching applications including but not limited to file searching and network intrusion detection systems. Most text editors and search utilities specify search targets using some form of regular expression syntax. As an illustrative example, using perl syntax, the pattern shown in FIG. 1 is intended to match strings that denote US currency values:                a backslash “\” precedes any special symbol that is to be taken literally;        the low- and high-value of a character range is specified using a dash “-”;        the “+” sign indicates that the preceding expression can be repeated one or more times;        a single number in braces indicates that the preceding expression can be repeated exactly the designated number of times; and        a pair of numbers in braces indicates a range of repetitions for the preceding expression.Thus, strings that match the above expression begin with the symbol “$”, followed by some positive number of decimal digits; that string may optionally be followed by a decimal point “.” and exactly two more decimal digits. In practice, a pattern for such matches may also specify that the match be surrounded by some delimiter (such as white space) so that the string “$347.12” yields one match instead of four matches (i.e., “$3”, “$34”, “$347”, “$347.12”).        
Applications that use regular expressions to specify patterns of interest typically operate as follows: Given a regular expression r and a target string t (typically the contents of some input stream such as a file), find all substrings of t in L(r). The substrings are typically reported by their position within t. Thus, unless otherwise stated, it is generally intended that the pattern r is applied at every position in the target and that all matches are reported.
The simplest and most practical mechanism for recognizing patterns specified using regular expressions is the DFA, which is formally described as the 5-tuple:(Q, Σ, qo, δ, A)where:                Q is a finite set of states        Σ is an alphabet of input symbols.        q0∈ Q is the DFA's initial state        δ is a transition function: Q×ΣHQ        A⊂Q is a set of accepting states        
A DFA operates as follows. It begins in state qo. If the DFA is in state q, then the next input symbol a causes a transition determined by δ(q, a). If the DFA effects a transition to a state q ∈ A, then the string processed up to that point is accepted and is in the language recognized by the DFA. As an illustrative example, the regular expression of FIG. 1 can be translated into the canonical DFA shown in FIG. 2 using a sequence of well-known steps, including a step that makes the starting position for a match arbitrary (unanchored) (see the Hopcroft and Ullman reference cited above). For convenience, the DFA of FIG. 2 uses the term “[0-9]” to denote the set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} and uses the symbol “˜” to denote all symbols of Σ not in the set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, $, .}.
The construction of a DFA typically involves an intermediate step in which a nondeterministic finite automaton (NFA) is constructed. An NFA differs from a DFA in that whereas a DFA is a finite state machine that allows at most one transition for each input symbol and state, an NFA is a finite state machine that allows for more than one transition for each input symbol and state. Also, every regular language has a canonical DFA that is obtained by minimizing the number of states needed to recognize that language. Unless specified otherwise herein, it should be assumed that all automata are in canonical (deterministic) form.
However, for the purpose of pattern matching, the inventors herein believe that the DFA shown in FIG. 2 is deficient in the following respects:                Symbols not in the alphabet of the regular expression will cause the DFA to block. For pattern-matching, such symbols should be ignored by a DFA so that it can continue to search for matches. This deficiency can be overcome by completing the DFA as follows:                    The alphabet is widened to include any symbol that might occur in the target string. In this description, it is assumed that Σ is the ASCII character set comprising 256 symbols.            The DFA is augmented with a new state U:Q←Q∪{U}            The transition function δ is completed by defining δ (q, a)=U for all q∈Q, a ∈Σ for which δ was previously undefined.                        A match will be found only if it originates at the first character of the target string. Pattern-matching applications are concerned with finding all occurrences of the denoted pattern at any position in the target. This deficiency can be overcome by allowing the DFA to restart at every position. Formally, a λ transition is inserted from every q∈Q to qo.The result of the above augmentation is an NFA that can be transformed into a canonical DFA through known techniques to obtain the DFA. FIG. 3 provides an illustrative example of such a canonical DFA.        
A DFA is typically implemented interpretively by realizing its transitions δ as a table: each row corresponds to a state of the DFA and each column corresponds to an input symbol. The transition table for the DFA of FIG. 3 is shown in FIG. 4. If the alphabet Σ for the DFA is the ASCII character set (as is often the case in many applications), then the transition table of FIG. 4 would have 256 columns. Each entry in the transition table of FIG. 4 comprises a next state identifier. The transition table of FIG. 4 works thusly: if the DFA's current state is B and the next input symbol is 2, then the transition table calls for a transition to state D as “D” is the next state identifier that is indexed by current state B and input symbol 2. In the description herein, states are labeled by letters to avoid confusion with symbol encodings. However, it is worth noting that in practice, states are typically represented by an integer index in the transition table.
The inventors herein believe that the pattern matching techniques for implementing DFAs in a pipelined architecture can be greatly improved via the novel pattern matching architecture disclosed herein. According to one aspect of the present invention, a pipelining strategy is disclosed that defers all state-dependent (iterative, feedback dependent) operations to the final stage of the pipeline. Preferably, transition table lookups operate to retrieve all transition table entries that correspond to the input symbol(s) being processed by the DFA. Retrievals of transition entries from a transition table memory will not be based on the current state of the DFA. Instead, retrievals from the transition table memory will operate to retrieve a set of stored transition entries based on data corresponding to the input symbol(s) being processed.
In a preferred embodiment where alphabet encoding is used to map the input symbols of the input data stream to equivalence class identifiers (ECIs), these transition table entries are indirectly indexed to one or more input symbols by data corresponding to ECIs. This improvement allows for the performance of single-cycle state transition decisions, enables the use of more complex compression and encoding techniques, and increases the throughput and scalability of the architecture.
According to another aspect of the present invention, the transitions of the transition table preferably include a match flag that indicates whether a match of an input symbol string to the pattern has occurred upon receipt of the input symbol(s) that caused the transition. Similarly, the transitions of the transition table preferably include a match restart flag that indicates whether the matching process has restarted upon receipt of the input symbol(s) that caused the transition. The presence of a match flag in each transition allows for the number of states in the DFA to be reduced relative to traditional DFAs because the accepting states can be eliminated and rolled into the match flags of the transitions. The presence of a match restart flag allows the DFA to identify the substring of the input stream that matches an unanchored pattern. Together, the presence of these flags in the transitions contribute to another aspect of the present invention—wherein the preferred DFA is configured with an ability to scale upward in the number of bytes processed per cycle. State transitions can be triggered by a sequence of m input symbols, wherein m is greater than or equal to 1 (rather than being limited to processing only a single input symbol per clock cycle). Because of the manner by which the transitions include match flags and match restart flags, as disclosed herein, the DFA will still be able to detect when and where matches occur in the input stream as a result of the leading or an intermediate input symbol of the sequence of m input symbols that are processed together by the DFA as a group.
According to yet another aspect of the present invention, incremental scaling, compression and character-encoding techniques are used to substantially reduce the resources required to realize a high throughput DFA. For example, run-length coding can be used to reduce the amount of memory consumed by (i.e., compress) the DFA's transition table. Furthermore, the state selection logic can then operate on the run-length coded transitions to determine the next state for the DFA. Masking can be used in the state selection logic to remove from consideration portions of the transition table memory words that do not contain transitions that correspond to the ECI of the input symbol(s) being processed.
Also, according to yet another aspect of the present invention, a layer of indirection can be used to map ECIs to transitions in the transition table memory. This layer of indirection allows for the use of various optimization techniques that are effective to optimize the run-length coding process for the transition entries in the transition table memory and optimize the process of effectively packing the run-length coded transition entries into words of the transition table memory such that the number of necessary accesses to transition table memory can be minimized. With the use of indirection, the indirection entries in the indirection table memory can be populated to configure the mappings of ECIs to transition entries in the transition table memory such that those mappings take into consideration any optimization processes that were performed on the transition entries in the transition table memory.
Furthermore, according to another aspect of the present invention, disclosed herein is an optimization algorithm for ordering the DFA states in the transition table, thereby improving the DFA's memory requirements by increasing the efficiency of the run-length coded transitions.
Further still, disclosed herein is an optimization algorithm for efficiently packing the transition table entries into memory words such that the number of transition table entries sharing a common corresponding input symbol (or derivative thereof such as ECI) that span multiple memory words is minimized. This memory packing process operates to improve the DFA's throughput because the efficient packing of memory can reduce the number of memory accesses that are needed when processing one or more input symbols.
According to another aspect of the present invention, the patterns applied during a search can be changed dynamically without altering the logic of the pipeline architecture itself. A regular expression compiler need only populate the transition table memory, indirection table, ECI mapping tables, and related registers to reprogram the pattern matching pipeline to a new regular expression.
Based on the improvements to DFA design presented herein, the inventors herein believe that the throughput and density achieved by the preferred embodiment of the present invention greatly exceed other known pattern matching solutions.
These and other inventive features of the present invention are described hereinafter and will be apparent to those having ordinary skill in the art upon a review of the following specification and figures.