The problem of string searching occurs in many applications. The string search algorithm looks for a string called a “pattern” within a larger input string called the “text.” Multiple string searching refers to searching for multiple such patterns in the text string without having to search in multiple passes. In a string search, the text string is typically longer than several million bits long with the smallest unit being one octet in size. The start of a pattern string within the text is typically not known. A search method that can search for patterns when the start of patterns within the input string is not known in advance is known as unanchored searching. In an anchored search, the search algorithm is given the input string along with information on the offsets for start of the strings.
A network system attack (also referred to herein as an intrusion) is usually defined as an unauthorized or malicious use of a computer or computer network. In some cases, a network system attack may involve hundreds to thousands of unprotected network nodes in a coordinated attack, which is levied against specific or random targets. These attacks may include break-in attempts, including but not limited to, email viruses, corporate espionage, general destruction of data, and the hijacking of computers/servers to spread additional attacks. Even when a system cannot be directly broken into, denial of service attacks can be just as harmful to individuals and companies, who stake their reputations on providing reliable services over the Internet. Because of increasing usage and reliance upon network services, individuals and companies have become increasingly aware of the need to combat system attacks at every level of the network, from end hosts and network taps to edge and core routers.
Intrusion Detection Systems (or IDSs) are emerging as one of the most promising ways of providing protection to systems on a network. Intrusion detection systems automatically monitor network traffic in real-time, and can be used to alert network administrators to suspicious activity, keep logs to aid in forensics, and assist in the detection of new viruses and denial of service attacks. They can be found in end-user systems to monitor and protect against attacks from incoming traffic, or in network-tap devices that are inserted into key points of the network for diagnostic purposes. Intrusion detection systems may also be used in edge and core routers to protect the network infrastructure from distributed attacks.
Intrusion detection systems increase protection by identifying attacks with valid packet headers that pass through firewalls. Intrusion detection systems provide this capability by searching both packet headers and payloads (i.e., content) for known attack data sequences, referred to herein as “signatures,” and following prescribed actions in response to detecting a given signature. In general, the signatures and corresponding response actions supported by an intrusion detection system are referred to as a “rule-set database,” “IDS database” or simply “database.” Each rule in the database typically includes a specific set of information, such as the type of packet to search, a string of content to match (i.e., a signature), a location from which to start the search (e.g., for anchored searches), and an associated action to take if all conditions of the rule are matched. Different databases may include different sets of information, and therefore, may be tailored to particular network systems or types of attack.
At the heart of most modern intrusion detection systems is a string matching engine that compares the data arriving at the system to one or more signatures (e.g., strings or patterns) in the rule-set database and flags data containing an offending (e.g., matching) signature. As data is generally searched in real time in ever-faster network devices and rule databases continue to grow at a tremendous rate, string matching engines require rapidly increasing memory capacity and processing power to keep pace. Consequently, to avoid the escalating costs associated with ever-increasing hardware demands, designers have endeavored to improve the efficiency of the string matching methodology itself.
The classic Aho-Corasick algorithm builds a deterministic finite state automaton (DFA) that encodes all the strings to be searched. The DFA is typically constructed in two phases. In the first phase, a search tree such as a goto-failure state diagram embodying the patterns to be matched is constructed, with each string represented by a sequence of states extending from the root node with goto transitions (e.g., success transitions) inserted between sequential states. The strings that share a common prefix also share a corresponding set of parent nodes in the tree, and the distance of a state from the root node in the tree is called the depth of that state. To match a string, the state machine starts at the root node and transitions states according to the goto transitions of the search tree. For example, if in a given state the input character matches the goto transition, the goto transition is taken to the next state in the string path and a cursor is incremented to point to the next character in the input string. Conversely, if the input character does not match the goto transitions at any given state, which results in edge failure, a failure transition may be taken.
In the second phase, next transitions are inserted to states that represent prefixes of states that result in edge failure so that instead of returning to the root node upon edge failure, the state machine can transition to another state that represents an accumulated prefix of the state that resulted in edge failure and advance the cursor to a next input character in the input string. The addition of next transition thus improves performance by avoiding the re-examination of input characters upon edge failure.
State machines constructed using the Aho-Corasick technique are often implemented using an architecture that includes a ternary content addressable memory (TCAM), an associated memory such as SRAM, and control logic. Each TCAM entry represents a corresponding transition in the state machine, and is associated with a corresponding entry in the SRAM whose address is computed from the TCAM index. More specifically, the TCAM entries are logically partitioned into a current state field and an input character field (e.g., which contains a goto transition from the state), and the SRAM entries are partitioned into a next state field, a result field, and an action field. Although TCAM devices are effective in performing string search operations, storage limitations of TCAM devices undesirably limit the size and number of signatures of a search tree that can be stored in the TCAM device. Therefore, there is a need to more efficiently store the state entries of a search tree in TCAM devices.
Like reference numerals refer to corresponding parts throughout the drawing figures.