The present invention is generally directed to pattern matching. More specifically, the present invention is directed to a method and system for multi-character multi-pattern pattern matching.
Pattern matching is used to detect the occurrence of a pattern or keyword in a search target. For example, given two strings, a simple pattern matching algorithm can detect the occurrence of one of the strings in the other. Pattern matching algorithms are widely used in information retrieval applications, such as data mining, bibliographic searching, search and replace text editing, word processing, etc., and content inspection applications, such as Network Intrusion Detection Systems (NIDS), virus/worm detection using signature matching, IP address lookup in network routers, DNA sequence matching, etc.
For many applications, it is necessary to detect multiple patterns in a particular search target. A conventional multi-pattern pattern matching algorithm is the Aho-Corasick (AC) algorithm. The AC algorithm locates all occurrences of any of a finite number of keywords in a string by constructing a finite state pattern matching machine to process the string in a single pass. For example, this algorithm can be used to detect virus/worm signatures in a data packet stream by running the data packet stream through the finite state machine byte by byte.
The AC algorithm constructs the finite state machine in three pre-processing stages, namely the goto stage, the failure stage, and the next stage. In the goto stage, a deterministic finite state automaton (DFA) or keyword trie is constructed for a given pattern set. The DFA constructed in the goto stage includes various states for an input string, and transitions between the states based on characters of the input string. Each transition between states in the DFA is based on a single character of the input string. The failure and next stages add additional transitions between the states of the DFA to ensure that a string of length n can be searched in exactly n cycles. Essentially, these additional transitions help the algorithm to slide from the currently matched pattern (not a match anymore) to another pattern which is the next best (longest prefix) match in the DFA. Once the pre-processing has been performed, the DFA can then be used to search any target for all of the patterns in the pattern set.
During the search stage, the AC DFA processes one character (or byte) per transition in the DFA, and each transition is stored in a memory. Accordingly, the AC DFA transitions to a different state based on each character of the input string. Hence, for each character in an input string, a memory lookup operation must be performed to access the transitions from the current state of the AC DFA and compare the transitions to the character.
Virus/worm detection applications must detect the presence of multiple virus/worm signatures in a stream of packets in a single pass as the packets are transmitted in a data network. As network speeds have increased, conventional pattern matching algorithms, such as the AC algorithm, cannot perform at high enough speeds to keep up the network speeds. One reason that conventional pattern matching algorithms cannot perform at high speeds is because a memory lookup operation must be performed for each byte of the stream of packets.
What is needed is a multi-pattern matching method that can be used for high speed applications.