Identifying patterns or strings of interest in network packet streams has practical uses in fields as diverse as databases, network security and computer vision or imaging. Many known functions for performing single or multiple pattern matching exist. The worst case performance for the more efficient of these functions is linear to the amount of data being matched against a pattern. For example, the search speeds for finite state automaton (FSA) based functions, such as Knuth-Morris-Pratt and Aho-Corasick, are generally linear to the data size and at the same time independent of number of target signature strings to match against. However, these techniques generally utilize a large amount of system memory, such as random access memory (RAM), required to store such a FSA. The use of these types of memory-based schemes also reduces the rate at which patterns can be matched due to the frequent memory fetch operations, thus making the schemes infeasible for gigaspeed routers. For high speed operations, hardware alternatives using field programmable arrays (FPGAs) exist, but these schemes are not flexible when the target patterns to look for frequently change. Also, hardware costs for implementing the matching function using FPGAs becomes high.
Hash-based schemes offer a good alternative to FSA-based implementations as they can efficiently store the list of signature strings in a small amount of memory. However, known hash-based scheme suffer from the inherent disadvantage of false positives, i.e., the hash values of distinct patterns may be equal. Furthermore, known hash-based schemes may not be used to identify long patterns spread across multiple packets, because these schemes cannot store the state of data when data possibly including a large target pattern is split among several packets.