1. Field of the Invention
This invention generally relates to pattern recognition of character strings using regular expressions, and more particularly relates to methods and engines for searching character strings for patterns and determining the start of a matching pattern.
2. Description of the Prior Art
Regular expressions are formuli used for matching character strings that follow some pattern. They are made up of normal characters, such as upper and lower case letters and numbers, and “metacharacters”, which are symbols, such as / * | [ ], or the like, that have special meanings. Regular expressions are well known in the art, and for a more complete explanation of what they are and how they are used in pattern matching, reference should be made to Mastering Regular Expressions, by Jeffrey E. F. Friedl, published by O'reilly and Associates, Inc., the disclosure of which is incorporated herein by reference.
Two different regular expression (“regex”) engines commonly used for searching for patterns in a character string are a non-deterministic finite state automaton (NFA) and a deterministic finite state automaton (DFA). Again, reference should be made to the aforementioned publication, Mastering Regular Expressions, for a more complete explanation of how an NFA and DFA function.
FIG. 1 illustrates one conventional pattern matching scheme using either an NFA or a DFA. In this example, the pattern to be matched is expressed as the regex (a*|b)x. The character string being sampled is eight characters long, for this particular illustrative example.
In the example shown in FIG. 1, the first step (Step 1) in this conventional method of pattern matching is where the pattern is anchored at the first character in the string, which is character no. 0 and which is the character “a”. The matcher (i.e., the NFA or DFA) consumes characters until it reaches a failure state, which for the first step (Step 1) in the method occurs at character no. 6 in the string (which, is the lower case letter “b”). In the example, it should be noted that “m” represents a successful match, “f” represents that the match has failed, and “M” represents that the match is successful.
In the second step (Step 2) of this method of pattern matching, the pattern is now anchored at the second character in the string (i.e., character no. 1), which is also the lower case letter “a”. The pattern begins matching at character no. 1 and, again, fails at character no. 6 (i.e., the seventh character in the string), which is the lower case letter “b”. Thus, it should be noted that the pattern matcher (i.e., the NFA or DFA), in Step 2, has now gone over six characters that have already been considered in Step 1 of the pattern matching method. Thus, for a character string of eight characters, and for the given pattern of /(a*|b)x/, expressed as a regex, 29 characters must be considered before a match is found. As shown in FIG. 1, the match occurs in Step 7, Where the pattern is anchored at character no. 6.
The advantage of this scheme is that the start and the end of the match are known. The disadvantage is that, in the worse case situation, n2 characters must be considered, where n is the length of the input string. Thus, if m patterns are to be considered simultaneously using this conventional method, and a separate pass is made on the input string for each pattern, the total number of comparisons performed is m×n2.
Another method of pattern matching using regular expressions is described below. If, for example, there were two patterns, one of which is expressed by the regex/(a*|b)x/, as in the example given above and shown in FIG. 1, and the other pattern is the regex/pqr/, the two patterns may be combined into the following pattern: /.*(a*|b)x|.*pqr/
This particular pattern will succeed only if either of the original patterns succeed (i.e., are matched), and the end of the match for this combined pattern will occur in the same place as if the original patterns were searched individually. What is more, the pattern matcher will find the match after considering at most n characters, since the pattern is anchored at the first character and will run from there.
The problem, however, with this second pattern matching scheme is that it is unclear where the start of match occurs. (The end of the match is known, as the matcher knows the character number when a terminal or accepting state is reached.)