A network system attack (also referred to herein as an intrusion) is usually defined as an unauthorized or malicious use of a computer or computer network. In some cases, a network system attack may involve hundreds to thousands of unprotected network nodes in a coordinated attack, which is levied against specific or random targets. These attacks may include break-in attempts, including but not limited to, email viruses, corporate espionage, general destruction of data, and the hijacking of computers/servers to spread additional attacks. Even when a system cannot be directly broken into, denial of service attacks can be just as harmful to individuals and companies, who stake their reputations on providing reliable services over the Internet. Because of the population's increasing usage and reliance on network services, individuals and companies have become increasingly aware of the need to combat system attacks at every level of the network, from end hosts and network taps to edge and core routers.
Intrusion Detection Systems (or IDSs) are emerging as one of the most promising ways of providing protection to systems on a network. Intrusion detection systems automatically monitor network traffic in real-time, and can be used to alert network administrators to suspicious activity, keep logs to aid in forensics, and assist in the detection of new viruses and denial of service attacks. They can be found in end-user systems to monitor and protect against attacks from incoming traffic, or in network-tap devices that are inserted into key points of the network for diagnostic purposes. Intrusion detection systems may also be used in edge and core routers to protect the network infrastructure from distributed attacks.
Intrusion detection systems increase protection by identifying attacks with valid packet headers that pass through firewalls. Intrusion detection systems provide this capability by searching both packet headers and payloads (i.e., content) for known attack data sequences, referred to as “signatures,” and following prescribed actions in response to detecting a given signature. In general, the signatures and corresponding response actions supported by an intrusion detection system are referred to as a “rule-set database,” “IDS database” or simply “database.” Each rule in the database typically includes a specific set of information, such as the type of packet to search, a string of content to match (i.e., a signature), a location from which to start the search, and an associated action to take if all conditions of the rule are matched. Different databases may include different sets of information, and therefore, may be tailored to particular network systems or types of attack.
At the heart of most modern intrusion detection systems is a string matching engine that compares the data arriving at the system to one or more signatures (i.e., strings) in the rule-set database and flags data containing an offending signature. As data is generally searched in real time in ever-faster network devices and rule databases continue to grow at a tremendous rate, string matching engines require rapidly increasing memory capacity and processing power to keep pace. Consequently, to avoid the escalating costs associated with ever-increasing hardware demands, designers have endeavored to improve the efficiency of the string matching methodology itself. FIG. 1, for example, illustrates a progression of approaches that have resulted in increased string matching speed, reduced memory requirements or both. In one approach, signatures within a signature definition (i.e., list of signatures) are decomposed into elemental states to form a representation of the signature definition referred to as a non-optimized state graph 110. After generating the state graph, a string matching engine may progress from state to state within the graph according to the sequence of incoming values. For example, starting from root node ‘0’, if an ‘a’ is received, the string matching engine transitions to state ‘1’ and, if a ‘b’ is received, from state ‘1’ to state ‘2.’ State ‘2’ is referred to as an output state 112 (and shown in bold in FIG. 1 to indicate such) as a sequence of data values matching the signature “ab” has been detected. After reaching state ‘2’, the string matching engine may transition to states 3, 4 and 5 or 3, 6, 7 and 8, upon receiving data sequences “cde” or “ebc,” respectively, or may return to the root node 112 if, at any node, the data at the cursor position (i.e., the incoming data value to be evaluated, also referred to herein as the “cursor data”) does not match any of the transition data values, referred to herein as edges, associated with the current node. Thus, if at node ‘3’ the cursor data is neither a ‘d’ nor an ‘e’, the string matching engine returns to the root node 112 and evaluates the cursor data against the edges (i.e., ‘a’ and ‘b’) at that node. If the cursor data does not match an edge of the root node, the string matching engine remains at the root node and advances the cursor.
String matching using the non-optimized state graph 110 results in a number of inefficiencies both in terms of state graph storage requirements and state graph processing. With regard to storage, each signature definition requires a worst-case storage of N*W nodes (i.e., where N is the number of signatures and W is the average signature length, and where the worst case occurs when there are no shared sub-strings between signatures) and thus escalates exponentially if both the length and number of signatures increases as has been the trend. From the stand point of state graph processing, returning to the root node for each non-matching cursor data requires rewinding the cursor according to the length into the trie and thus substantial reprocessing of data.
Still referring to FIG. 1, a string-matching approach with substantially reduced data reprocessing may be achieved using the Aho-Corasick state graph 120. In the Aho-Corasick scheme, instead of returning to the root in response to edge failure (i.e., when the cursor data does not match any of the transition data values), the string matching engine may transition to a state that constitutes an accumulated substring within the path in which edge failure occurs. For example, if edge failure occurs at state ‘3’ (i.e., cursor data is not a ‘d’ or ‘e’), the string matching engine, having traversed the path “abc” and thus detected substring “bc,” may transition directly to state ‘10’, which corresponds to detection of “bc,” without having to return to the root node and rewind the cursor. Thus, transition to a non-root node in response to edge failure, a transition referred to herein as a “failover,” may save substantial data reprocessing and thus reduce processing load on the string matching engine. Unfortunately, the gain in processing efficiency incurs a memory consumption penalty as the Aho-Corasick state graph requires storage of additional failover pointers (i.e., pointers to the failover destinations) in addition to the worst-case W*N storage associated with the non-optimized state graph.
In the last string matching approach shown in FIG. 1, path compression is applied to the Aho-Corasick state graph to obtain a path-compressed state graph 130 having a reduced number of nodes, and therefore a reduced memory storage requirement. Path compression involves concatenating linear (i.e., non-branching) sequences of state transitions into a single state transition with the sequence of data values that formerly formed the edges in the sequence of states concatenated into a string that forms the edge in the unified transition. The result of this operation is to reduce the number of nodes from W*N relative to the traditional Aho-Corasick scheme shown at 120 to a worst-case 2N nodes (i.e., each new signature requires addition of at most two nodes as when an existing path-compressed node is changed into a branch node plus two path-compressed nodes) thus reducing the number of nodes by the factor W/N. Unfortunately, the reduction in the number of nodes does not translate to a proportional reduction in required memory as the number of pointers required generally remains at N*W plus a number of pointers assigned to enable failover operation. That is, because an edge failure in a path-compressed node may result from failure at each of the individual data values, a separate failure pointer (i.e., a pointer to a failover node or to the root node) is typically provided for each data value of the compressed string to enable the traditional Aho-Corasick failover operation. This is illustrated in FIG. 1 by the three separate failure pointers to the root 132 for the constituent data values ‘b’, ‘c’ and ‘f’ of the compressed string (note that while the failure pointers all point to the root in the example shown, if data values ‘b,’ ‘c’ or ‘f,’ or data sequences ‘bc’ ‘cf’ or ‘bcf’ appear in other extensions from the root node, the failure pointer for the corresponding data value may point to the appropriate state within the other extension to effect the Aho-Corasick failover operation). Thus, while substantial memory savings is achieved using the path-compressed Aho-Corasick approach, memory consumption still increases exponentially with increased signature quantity and length in each new generation of rule databases.