The communications industry is rapidly changing to adjust to emerging technologies and ever increasing customer demand. This customer demand for new applications and increased performance of existing applications is driving communications network and system providers to employ networks and systems having greater speed and capacity (e.g., greater bandwidth). In trying to achieve these goals, a common approach taken by many communications providers is to use packet switching technology. Increasingly, public and private communications networks are being built and expanded using various packet technologies, such as Internet Protocol (IP).
Regular expression matching is becoming a common operation to be performed at high speeds. For example, URLs may need to be located in Layer 7 (L7) packet headers only if they match a set of regular expressions to classify the sessions appropriately. Similarly, regular expression matching is used for intrusion detection, security screening (e.g., whether an email or other message contains certain patterns of keywords), load balancing of traffic across multiple servers, and array of many other applications.
A problem, especially for high speed applications, is the rate at which matching can be performed, as well as the space required to store the match identification data structure. A common method to match common expressions is to convert them to a deterministic finite automaton (DFA). The use of DFAs for regular expression matching which produces a set of matched regular expressions upon reaching a final state is well-known. From one perspective, a DFA is a state machine which processes each character of an input string, and upon reaching a final state, generates a list of one or matched regular expressions. The memory requirements and speed at which these DFAs may be traversed may not meet the needs of certain applications, especially some high-speed applications.
For example, if multiple regular expressions are to be simultaneously matched against, then the DFAs for the different regular expressions typically are multiplied to get a single DFA for the entire collection. However, multiplying DFAs together can generate an exponential number of states, thus making it impractical for certain applications. Individual DFAs could be simultaneously checked, however such an approach requires that the state for each DFA be updated for each character processed. For each character in the string this could mean a large number of memory accesses, one for each DFA. Alternatively, the DFAs could be multiplied together to form a combined DFA.
Traditional literature discusses nondeterministic finite automata (NFAs) and DFAs with the intent of producing a single DFA. Indeed, most approaches to the problem have involved compiling separate and disjunctive sets of regular expressions into DFAs. Here there tend to be two extremes. First, for a purely table driven approach, the largest DFAs are constructed and run in parallel. Second, a recent hardware accelerated approach is to create many smaller DFAs and run those in parallel. What these approaches share in common is that they perform the partitioning at the regular expression level. A DFA represents sets of whole and entire regular expressions. This produces the deterministic property, but also adds greatly to the resources necessary to implement such a partitioning of the problem space, either an excessive table footprint, or many processors running in parallel.