1. Field of the Invention
Embodiments of the present invention relate to methods and systems for efficiently searching for patterns in a character stream. More particularly, embodiments of the present invention relate to systems and methods for optimizing and reducing the memory requirements of state machine algorithms in pattern matching applications.
2. Background Information
An intrusion detection system (IDS) examines network traffic packets. An IDS searches for intruders by looking for specific values in the headers of packets and by performing a search for known patterns in the application data layer of packets. Typical IDSs rely heavily on the Aho-Corasick multi-pattern search engine. As a result, the performance characteristics of the Aho-Corasick algorithm have a significant impact on the overall performance of these IDSs.
The pattern search problem in IDSs is a specialized problem that requires consideration of many issues. The real-time nature of inspecting network packets requires the use of a pattern search engine that can keep up with the speeds of modern networks. There are two types of data used in this type of search. These two types of data are the patterns and the search text. The patterns are pre-defined and static. This allows them to be pre-processed into the best form suitable for any given pattern matching algorithm. The search text is dynamically changing as each network packet is received. This prohibits the pre-processing of the search text prior to performing the pattern search. This type of pattern search problem is defined as a serial pattern search. The Aho-Corasick algorithm is a classic serial search algorithm, and was first introduced in 1975. Many variations have been inspired by the Aho-Corasick algorithm, and more recently IDSs have generated a renewed interest in the algorithm due to some of the algorithm's unique properties.
IDSs search for patterns that can be either case sensitive or case insensitive. The Aho-Corasick algorithm was originally used as a case sensitive pattern search engine. The size of the patterns used in a search can significantly affect the performance characteristics of the search algorithm. State machine based algorithms, such as the Aho-Corasick algorithm, are not affected by the size of the smallest or largest pattern in a group. Skip based methods, such as the Wu-Manber algorithm and others that utilize character skipping features, are very sensitive to the size of the smallest pattern. The ability to skip portions of a search text can greatly accelerate the search engines performance. However, skip distance is limited by the smallest pattern. Search patterns in IDSs represent portions of known attack patterns and can vary in size from one to thirty or more characters but are usually small.
The pattern group size usually affects IDS performance because the IDS pattern search problem typically benefits from processor memory caching. Small pattern groups can fit within the cache and benefit most from a high performance cache. As pattern groups grow larger, less of the pattern group fits in the cache, and there are more cache misses. This reduces performance. Most search algorithms will perform faster with ten patterns than they do with one-thousand patterns, for instance. The performance degradation of each algorithm as pattern group size increases varies from algorithm to algorithm. It is desirable that this degradation be sub-linear in order to maintain scalability.
The alphabet size used in current IDSs is defined by the size of a byte. An eight bit byte value is in the range 0 to 255, providing Intrusion Detection Systems with a 256-character alphabet. These byte values represent the ASCII and control characters seen on standard computer keyboards and other non-printable values. For instance, the letter ‘A’ has a byte value of sixty-five. Extremely large alphabets such as Unicode can be represented using pairs of byte values, so the alphabet the pattern search engine deals with is still 256 characters. This is a large alphabet by pattern matching standards. The English dictionary alphabet is fifty-two characters for upper and lower case characters, and DNA research uses a four character alphabet in gene sequencing. The size of the alphabet has a significant impact on which search algorithms are the most efficient and the quickest.
Algorithmic attacks by intruders attempt to use the properties and behaviors of a search algorithm against itself to reduce the search algorithm's performance. The performance behavior of a search algorithm should be evaluated by considering the search algorithm's average-case and worst-case performance. Search algorithms that exhibit significantly different worst-case and average-case performance are susceptible to algorithmic attacks by intruders. Skip based search algorithms, such as the Wu-Manber algorithm, utilize a character skipping feature similar to the bad character shift in the Boyer-Moore algorithm. Skip based search algorithms are sensitive to the size of the smallest pattern since they can be shown to be limited to skip sizes smaller than the smallest pattern. A pattern group with a single byte pattern cannot skip over even one character or it might not find the single byte pattern.
The performance characteristics of the Wu-Manber algorithm can be significantly reduced by malicious network traffic, resulting in a Denial of Service attack. This type of attack requires degenerate traffic of small repeated patterns. This problem does not exist for text searches where algorithmic attacks are not intentional and are usually very rare. It is a significant issue in IDSs where algorithmic attacks are prevalent and intentional. The Wu-Manber algorithm also does not achieve its best performance levels in a typical IDS, because IDS rules often include very small search patterns of one to three characters, eliminating most of the opportunities to skip sequences of characters.
In practice, an IDS using the Wu-Manber algorithm provides little performance improvement over an IDS using the Aho-Corasick algorithm in the average-case. However, an IDS using the Wu-Manber algorithm can be significantly slower than an IDS using the Aho-Corasick algorithm when experiencing an algorithmically designed attack. The strength of skip-based algorithms is evident when all of the patterns in a group are large. The skip-based algorithms can skip many characters at a time and are among the fastest average-case search algorithms available in this case. The Aho-Corasick algorithm is unaffected by small patterns. The Aho-Corasick algorithm's worst-case and average-case performance are the same. This makes it a very robust algorithm for IDSs.
The size of search text read by in IDSs is usually less than a few thousand bytes. In general when searching text, the expense of setting up the search pattern and filling the cache with the necessary search pattern information is usually fixed. If the search is performed over just a few bytes, then spreading the setup costs over those few bytes results in a high cost per byte. Whereas, if the search text is very large, spreading the setup cost over the larger search text results in very little overhead cost added to searching each byte of text.
The frequency of searching in an Intrusion Detection System is dependent on the network bandwidth, the volume of traffic on the network, and the size of the network packets. This implies that the frequency of pattern searches and the size of each search text are related due to the nature of the network traffic being searched. Again, as with search text size, a high frequency of searching in an IDS will cause the search setup costs to be significant compared to doing fewer larger text searches.
In view of the foregoing, it can be appreciated that a substantial need exists for methods and systems that take into consideration the many issues associated with pattern searching in IDSs and improve the overall performance of IDSs.