Packet classification is employed by Internet routers to implement a number of advanced Internet services such as policy-based routing, rate-limiting, access control in firewalls, service differentiation, traffic shaping, and traffic billing. Each of these services requires the router to classify incoming packets into different classes and then to perform appropriate actions depending upon the packet's specified class. For example, in packet routing applications, an incoming packet is classified to determine whether to forward or filter the packet, where to forward the packet to, what class of service the packet should receive, and/or how much should be charged for transmitting the packet. A packet classifier embodies a set of policies or rules that define what actions are to be taken based upon the contents of one or more fields of the packet's header. The packet header, which typically includes source and destination addresses, source and destination port numbers, protocol information, and so on, can match more than one rule. For example, one rule in a firewall application can specify either a “permit” or “deny” action for a given set of source and destination addresses, another rule in the firewall application can specify either a “permit” or “deny” action for a given protocol, and yet another rule in the firewall application can specify either a “permit” or “deny” action for a particular source address and protocol.
In addition, security is increasingly becoming a critical issue in enterprise and service-provider networks as usage of public networks such as the Internet increases. Many organizations continue to rely upon firewalls as their central gatekeepers to prevent unauthorized users from entering their networks. However, organizations are increasingly looking to additional security measures to counter risk and vulnerability that firewalls alone cannot address. Intrusion Detection Systems (IDSs) analyze data in real time to detect, log, and stop misuse or attacks as they occur. More specifically, network-based IDSs analyze packet data streams within a network searching for unauthorized activity, such as attacks by hackers. In many cases, IDSs can respond to security breaches before systems are compromised. When unauthorized activity is detected, the IDS typically sends alarms to a management console with details of the activity and can often order other systems, such as routers, to cut off the unauthorized sessions.
The parsing of data packets for an IDS is typically performed in software by a dedicated module or library executed by a processor. These modules or libraries may be written in any number of well-known programming languages. The Perl programming language is often selected because of its highly developed pattern matching capabilities. In Perl, the patterns that are being searched for are generally referred to as regular expressions. A regular expression can simply be a word, a phrase or a string of characters. More complex regular expressions include metacharacters that provide certain rules for performing the match. The period (“.”), which is similar to a wildcard, is a common metacharacter. It matches exactly one character, regardless of what the character is. Another metacharacter is the plus sign (“+”) which indicates that the character immediately to its left may be repeated one or more times. If the data being searched conforms to the rules of a particular regular expression, then the regular expression is said to match that string. For example, the regular expression “gauss” would match data containing gauss, gaussian, degauss, etc.
To search an input string for a regular expression, the regular expression is typically converted into a search tree such as a non-deterministic finite automaton (NFA) that includes a number of states interconnected by goto or “success” transitions. Then, to search the input string for the regular expression, a state machine starts at an initial state of the NFA and transitions states of the NFA according to its goto transitions. If in a given state the input character matches the goto transition, the goto transition is taken to the next state in the string path and a cursor is incremented to point to the next character in the input string. If the input character does not match the goto transitions at any given state, which results in edge failure, a failure transition may be taken.
For example, FIG. 1 shows an NFA 100 that embodies the regular expression R1=“ab([a-f][a-d])+def.” The NFA 100 includes an initial state or root node S0, intermediate states S1-S6, and a match state S7. The sequence of states S0-S7 are connected by goto transitions representing character matches with an input string, as indicated in FIG. 1. For example, if an input character “a” is received at the initial state S0, the state machine transitions to S1 along the “a” goto transition, and the cursor is incremented to the next input character in the input string. If the next input character is a “b,” the state machine transitions from S1 to S2 along the “b” goto transition. Then, if the next input character belongs to the character class [a-f], the state machine transitions from S2 to S3 along the “[a-f]” goto transition. Conversely, if at S1 the next input character is anything other than a “b,” or if at S2 the next input character is not ‘a,’ ‘b,’ ‘c,’ ‘d,’ ‘e,’ or ‘f,’ then the state machine returns to the initial state S0, which typically remains active for the NFA, as indicated by the arrow 101.
The regular expression R1=“ab([a-f][a-d])+def” is a complex regular expression for which multiple portions of the corresponding NFA 100 can be active at the same time. For example, upon a character class match with [a-d] at state S3, the state machine transitions to both states S2 and S4, and therefore states S2 and S4 are simultaneously active. For example, Table 1 depicts a search operation between an input string IN1=“abbababccdefg” and R1 according to the NFA 100 of FIG. 1.
TABLE 1CycleInput CharacterActive StatesAction1a12b23b34a1, 2, 45b2, 36a1, 2, 3, 47b2, 3, 48c2, 3, 49c2, 3, 410 d2, 3, 4, 511 e3, 612 f7match13 gnone
As shown in Table 1, in response to the first three input characters “abb,” the state machine transitions from the initial state S0 to state S3 along goto transitions “a,” “b,” and “[a-f],” respectively. The fourth input character ‘a’ of IN1 matches “[a-d]” and causes the state machine to transition from state S3 to both states S2 and S4 at the same time. Further, because ‘a’ is the first character of R1, the state machine also transitions to state S1 (not shown in FIG. 1 for simplicity). Thus, in the fourth cycle, states S1, S2, and S4 are simultaneously active in NFA 100. The next input character ‘b’ causes the state machine to simultaneously transition from S1 to S2 and from S2 to S3. The search operation proceeds as depicted in Table 1 until the match state S7 is reached upon a match with the input character ‘f’ in the twelfth cycle. The state machine reaches the match or accept state S7, which indicates a match condition between the input string and the regular expression. Thereafter, the last input character ‘g’ does not match the goto transition from S1, and thus does not result in an additional match. Note that although not indicated in Table 1 for simplicity, the initial state S0 remains active during search operations so that a match sequence can be initiated at any point in the input string.
To determine the length of the portion of the input string that matches the regular expression, the number of state transitions taken between the initial and match states of the NFA during the string search operation are typically counted. However, for complex regular expressions such as R1 in which multiple states of the corresponding NFA can be active at the same time, the number of state transitions between the initial and match states of the NFA does not provide the lengths of any overlapping or nested substrings that also match the regular expression, and therefore may also complicate identification of such substrings. Further, use of the NFA search operation to calculate the lengths of all matching substrings may require a separate counter for each potentially matching substring, which is impractical.
Like reference numerals refer to corresponding parts throughout the drawing figures.