The problem of string searching occurs in many applications. The string search algorithm looks for a string called a “pattern” within a larger input string called the “text.” Multiple string searching refers to searching for multiple such patterns in the text string without having to search multiple passes. In a string search, the text string is typically several thousand bits long with the smallest unit being one octet in size. The start of a pattern string within the text is typically not known. A search method that can search for patterns when the start of patterns within the argument text is not known in advance is known as unanchored searching. In an anchored search, the search algorithm is given the text along with information on the offsets for start of the strings.
A generalized multiple string search is utilized in many applications such as URL based switching, Web caching, XML parsing, text compression and decompression, analyzing DNA sequences in the study of genetics and intrusion detection systems for the internet. In string searching applications, an argument text is presented to the string search engine, which then searches this text for the occurrence of each of a multiple patterns residing in a database, as illustrated in FIG. 1. If a match is found, then an index or code that uniquely identifies the matching pattern entry in the database is returned along with a pointer (offset) to the matching position in the input text string. The pointer indicates the number of characters positions that are offset from the starting character of the string for which a matching pattern in the database is found in the input text string.
For example, consider the input text string: “We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness.” Assume that the pattern “that” is stored in the pattern database as a first pattern (Pattern 1) and the pattern “are” is stored in the pattern database as a second pattern (Pattern 2). For the two pattern strings “that” and “are,” a string search engine utilizing a matching algorithm may output a result of Offset-41/Pattern 1 because the pattern “that” was found as a pattern in the database and the first character “t” in the pattern “that” is offset 41 places from the starting character “W” of the input text string. The other results, for example, would be as follows: Offset-54/Pattern 2; Offset-73/Pattern 1; Offset 83/Pattern 2; Offset 145:/Pattern 1; Offset 162/Pattern 2.
Some prior string search engines are based on software algorithms such as Boyer-Moore that are inherently slow and have limited throughput. Other prior string search engines utilize the Aho-Corasick algorithm for string matching in which either a static random access memory (SRAM) or content addressable memory (CAM) based lookup table is used to implement state transitions in the string search engine. One problem with prior string search engines utilizing the Aho-Corasick algorithm, such as disclosed in U.S. Pat. No. 5,278,981, is that that they are incapable of performing wildcard or inexact matching. While some prior methods are capable of performing wildcard matching such as disclosed in U.S. Pat. No. 5,452,451, the inexact matching feature is limited only to prefixes in text strings. Moreover, such prior methods are only capable of anchored searches in which the start of patterns within the incoming text string must be known and identified to the search engine. Further, such prior methods are not capable of case insensitive matching that is required in many applications. In addition, for a given pattern database, such prior methods require a large number of entries in a CAM device. In addition, the prior methods are not capable of increasing the search speed by processing multiple octets from the text string concurrently.
Like reference numerals refer to corresponding parts throughout the drawing figures.