The invention relates to pattern matching.
One type of pattern matcher checks to see if a sequence of symbols (e.g., represented by data items such as bytes) matches any of a set of predetermined patterns. Pattern matching can be used for virus scanning, network intrusion detection, spam filtering and countless other operations whose computing demands increase with increase in Internet traffic. Matching patterns against a high-bandwidth data stream, such as a 10 gigabit per second Ethernet connection, can be challenging because the high data rate yields relatively few CPU cycles per input byte to examine the data. Complex patterns such as regular expressions can put more demand on the pattern matcher.
Some techniques to match patterns efficiently use pattern-matching finite state machines (PFSMs). A PFSM has associated states, and an input may transition the PFSM from a current state to another state, or an input may cause the PFSM to remain in the current state. Each portion of an input data stream can be considered an input that may potentially cause a transition between PFSM states.
A set of patterns can be compiled offline into one or more PFSMs that are fed portions (e.g., bytes or characters) of a data stream (in some cases, in real time) and report matches as they occur. In principle, any number of patterns can be combined into a single PFSM that uses a given number of cycles of processing time per input byte on average. Unfortunately, the memory size of a representation of this type of PFSM can grow exponentially with the number of patterns it is trying to match. Even though main memory sizes can be large, memory constraints may still be a factor since PFSM performance can be increased dramatically when its representation is able to fit in a processor's memory cache.