Regular expression search operations are employed in various applications including, for example, intrusion detection systems (IDS), virus protections, policy-based routing functions, internet and text search operations, document comparisons, and so on, to locate one or more patterns in an input string of characters. A regular expression can simply be a word, a phrase or a string of characters. For example, a regular expression including the string “gauss” would match data containing gauss, gaussian, degauss, etc. More complex regular expressions include metacharacters that provide certain rules for performing the match. Some common metacharacters are the wildcard “.”, the alternation symbol “|”, and the character class symbol “[ ].” Regular expressions can also include quantifiers such as “*” to match 0 or more times, “+” to match 1 or more times, “?” to match 0 or 1 times, {n} to match exactly n times, {n,} to match at least n times, and {n,m} to match at least n times but no more than m times. For example, the regular expression “a.{2}b” will match any input string that includes the character “a” followed by exactly 2 instances of any character followed by the character “b” including, for example, the input strings “abbb,” adgb,” “a7yb,” “aaab,” and so on.
Content search systems typically include a rules database that stores a plurality of rules (e.g., patterns or regular expressions), and one or more search engines that compare the input string with the rules to determine what action should be taken. For example, when determining whether to allow a user access to a particular website using an HTTP protocol, the packet payload containing an HTTP request for the site is typically provided to the search engine and then the payload is searched for all the rules in the database to determine whether any matching rules are found. If a matching rule is found (e.g., indicating that the requested website is on a list of restricted sites), then the search system generates a result that indicates an appropriate action to be taken (e.g., deny access to the website).
For the above example, conventional search systems typically search large portions (if not all) of the payload for all the rules stored in the database to selectively restrict access, even though (1) only small portions of the payload are relevant to the particular search operation and/or (2) not all of the rules stored in the database directly pertain to the particular search operation. For the former, searching non-relevant portions of the payload can result not only in false matches (e.g., in which a field of the payload unrelated to restricting access matches one of the rules) but also in unnecessarily long search times. For the latter, loading and/or searching an input string for rules that are not relevant to the particular search operation undesirably consumes more search engine resources than necessary.
As the number of users and websites on a network (e.g., the Internet) increase, so does the number of rules to be searched by various network components. Indeed, the size and complexity of rules being searched in various networking applications continues to explode at a pace that increasingly strains the storage and search speed capacities of existing search engines. Thus, there is a need to develop search systems that can more efficiently perform regular expression search operations using fewer resources.
Like reference numerals refer to corresponding parts throughout the drawing figures.