The Need for Pattern Matching and Searching in Networks
Computer networks, such as local area networks, wide area networks, and internetworks such as the Internet are common. As organizations adopt and increase network usage, the number of applications and demands placed upon network functionality increases. Accordingly, much effort in the art has been devoted to producing more efficient ways to process the ever-increasing amount of information transmitted over computer networks.
Data that is transmitted over a network is typically broken up into one or more discrete packets of data. Nodes of a computer network, e.g., routers, switches, firewalls, aggregators, and so forth, perform complex data analysis and management on these packets. It is these network nodes that typically have increasing demands placed upon them, and that require new inventive methods to keep up with the ever-increasing flow of network traffic.
One operation required in such nodes is the searching for specific patterns in the packets.
Network data analysis often involves scanning, processing or mining network data for useful information. For the simplest analysis problems, the standards according to which networks are designed provide a useful description, e.g., a map, for some information contained in network traffic. For example, many methods have been developed to scan network traffic for information by looking up packet headers, e.g., MAC addresses, IP addresses, TCP port numbers, and so forth. As a result, a number of early network analysis methods and products based thereon have been developed using simple, fixed-length pattern scanning techniques.
More complex network analysis problems require processing a plurality of packets that are related to each other. Such analysis is typically required when one wishes to analyze traffic at a higher level. For example, higher level network traffic analysis is required to perform packet classification, protocol analysis, URL address lookup, payload content inspection, virus signature scanning, intrusion detection, stateful firewall rule matching, load balancer rule matching, and so forth. Further, such higher-level analysis sometimes requires more complicated searching functionality than simple field lookups according to a network standard.
Pattern Matching and Searching Methods
Searching an input, e.g., a network data stream, for a simple sequence of inputs is well known in the art. The inputs in general are called “characters” of an “alphabet.” For example, an input may be an octet of the alphabet of all possible octets. Alphabetical characters are used herein to describe inputs only for the sake of explanation, and also because searching for text patterns is common. For example, a simple sequential character-searching problem might be trying to find if and where the sequence “cab” appears in the input “abacab.” Some methods of the art for solving such a problem include the Knuth-Morris-Pratt method (D. E Knuth, J. H. Morris Jr., V. R. Pratt: Fast pattern matching in strings, SIAM Journal on Computing, vol. 6, no. 1, pp. 323-350, 1977), the Aho & Corasick method (A. Aho, and B. Corasick: Efficient string matching: An aid to bibliographic search, Commun. the ACM, vol. 18, no. 6, pp. 333-343. 1975), the Rabin-Karp method (Richard Karp and Michael Rabin: Efficient randomized pattern-matching algorithms, IBM J. Res. Develop., vol. 31, no. 2, pp. 249-260, 1987), and the Boyer-Moore method (R. S. Boyer, J. S. Moore: A fast string searching algorithm, Comm. ACM, vol. 20, pp. 62-72, 1977).
Searching an input for a more complicated pattern of inputs, e.g., pattern matching, is also well known in the art. For example, one may wish to find all string sequences starting with “a” followed by at least one “b” in the input “cbaabbcaba.” Such a problem can be more generally described as finding certain input sequences, e.g., a pattern, in another sequence of inputs.
Regular Expressions
Pattern matching typically involves the use of a pattern description language, e.g., a grammar, to describe the pattern and then a method to transform an expression in the language to a state machine that will run with an input that may contain the pattern. For example, one popular pattern description language is known as the language of Regular Expressions. See, for example, Jeffrey E. F. Friedl, Andy Oram, Mastering Regular Expressions, 2nd Edition, O'Reilly, 2002. See also the online guide at called “Regular Expressions”, part of The Single UNIX® Specification, Version 2, from The Open Group, available at http://www.opengroup.org/onlinepubs/007908799/xbd/re.html (valid on Oct. 17, 2004).
While those skilled in the art would be familiar with regular expressions, a brief introduction is provided herewith for completeness. Regular expressions are useful because they can express complicated patterns using only a relatively small number of characters. For example all string sequences starting with “a” followed by at least one “b” can be represented as “ab+.” See Table 1 for a short list of some regular expression rules and Table 2 for some examples of regular expressions.
TABLE 1Some Regular Expression Rules.Match any character*Match the preceding occurring at least one time,or no times+Match the preceding 1 or more timesA|BMatch A or B(A)Match just A (often used in conjunction with |)[list]Match a character, only if it is in list[{circumflex over ( )}list]Match a character, only if it is not in list
TABLE 2Some Regular Expression Examples.*abcMatches “89abc”, “abc” but not “89abd”a(b|c)Matches “ab”, “ac”, but not “ax”A*B+Matches “AABB”, “AB”, but not “AA”b[0,5]cMatches “b0c”, “b4c”, but not “b9c” or “bc”State Machines and Finite Automata for Regular Expressions
It is well known to those in the art to use state machines to search for patterns, e.g., patterns expressed by regular expressions. Those in the art also know that finite automata theory can be used to abstractly describe state machines that search for patterns expressed by regular expressions. According to finite automata theory, a state machine is a limited abstract computing device that finds in an input the pattern(s) described by the state machine description. A state machine includes a finite set of states and an alphabet, e.g., a set of one or more inputs. There is one start state, e.g., state zero, and at least one final state defined as a state that produces a “hit,” called a hit state herein, and meaning that a pattern was found. Each state has one or more paths that point to any state of the set of states, e.g., another state or the same state. Each path is associated with an input from the alphabet of inputs.
When processing an input, e.g., a sequence of at least one input from the alphabet of possible inputs, a state machine starts in the start state. As the state machine reads each input, it moves to the next state defined by the path associated with the input. By some conventions, if there is no path associated with the input, the state machine moves to a special error state that indicates, e.g., that the pattern represented by the state machine does not exist in the sequence of inputs. By other conventions, every state must have a path associated with every possible input of the alphabet of inputs. If the state machine moves to a hit state, then this indicates a match, e.g., the pattern represented by the state machine description exists in the input. If the state machine reaches the end of the sequence of inputs without moving to a final state, then this indicates no match, e.g., the pattern represented by the state machine description does not exist in the sequence of inputs.
Deterministic Finite Automata (DFAs) and Non-Deterministic Finite Automata (NFAs)
It is known in finite automata theory to classify a state machine into one of two general classifications: deterministic finite automata (DFAs) and non-deterministic finite automata (NFAs). DFAs are state machines wherein each path of a given state is uniquely associated with an input of the alphabet. In other words, for a particular state, there cannot be two paths associated with the same input. Thus, for DFAs, the computational time complexity for processing an input has an upper bound, e.g., the computational time complexity is O(n) where n is the length of the sequence of inputs. Although generating a DFA from a regular expression, e.g., compiling a DFA from a regular expression requires additional computation, using the resulting DFA state machine with any input is in general very efficient.
NFAs are state machines that differ from DFAs in several ways. First, NFAs have no uniqueness restriction applied to paths, e.g., for any particular state, two or more paths may be associated with the same input from the alphabet of inputs, i.e., an NFA can have multiple paths from the same state, each path associated with the same input. Furthermore, NFAs allow what are called “null” paths or transitions. Such paths are represented as being associated with a special character ε that represents such a “null” path. A null path does not consume an input character; it represents a spontaneous transition without consuming an input. Additionally, when an NFA is used to process an input, a particular state that has null and non-null paths can transition via any one of the null paths, or via any one of the non-null paths associated with the current input. That is, when in a first state and processing an input for which there are a plurality of paths, if a particular chosen path does not lead to a match state, the NFA returns to that first, i.e. the previous state, and tries another possible path associated with that input. Thus, for NFAs, the computational time complexity for processing an input can be exponential, e.g., the computational time complexity can be O(cn) where n is the number of inputs and c an integer greater than 1. Typically, generating an NFA description for a pattern yields a state machine with a smaller set of states than generating a DFA description from the same pattern. However, using that NFA as a state machine for a provided input leads to unpredictable computational time complexity.
Finite automata theory is well known in the art, and as a result, a number of methods have been developed and are known to transform a NFA into a DFA and a DFA into a NFA. Methods also are known for transforming a pattern such as a regular expression into a NFA for matching the pattern, and into a DFA for matching the pattern.
Overview of Some Needs in the Art
As more and more networks are used, as the speed of such networks become higher and higher, and as networks become more and more complex, there is a general need in the art for processing network traffic more efficiently. When searching for patterns, e.g., patterns described by a pattern language such as a regular expression, there is a need in the art for managing computer implemented state machines in an efficient manner usable for such searching.
For example, as regular expressions become more complicated (more complex), the number of states in a DFA or NFA can be rather large. There is a need in the art for methods to efficiently represent the states of a finite state machine that can be used for searching for a pattern describable by a regular expression. For example, there is a need in the art to represent such states and state transition information of a state machine in a compressed manner that does not require much memory. There also is a need in the art to represent such state and state transition information in a manner that provides for rapidly reading the information from memory or storage so that searching can be carried out rapidly, e.g., in real time for relatively fast networks.
While how to design a state machine that can search (“match”) patterns describable by a single regular expression is known, there is a need in the art for a state machine and for a search method that can simultaneously search for a plurality of regular expressions.
It is desirable to be able to manage both patterns that are known a priori, e.g., patterns for well known protocols, for known packet formats, for known viruses, for known intrusion signatures, and so forth, and also patterns still to be discovered, e.g., patterns for new protocol and packet formats, for firewall security rules, for quality of service (“QoS”) policies, for newly discovered viruses or intrusion signatures, and so forth. There is thus a need in the art to efficiently manage, e.g., iteratively manage patterns as they are developed in a pattern database in a scalable manner.
Thus, there is a need in the art for a method and apparatus to efficiently generate, e.g., compile a set of at least one pattern into a set of at least one state machines that can be processed in O(n) computational time complexity for an input of size n. Further, there is a need in the art to efficiently transform a set of at least one state machines into a pattern database usable for processing input to generate an output of possible pattern matches.