Pattern matching algorithms provide for the identification of locations of occurrences of one or more patterns of symbols, such as characters or tokens from a symbol alphabet, within a symbol sequence. A specific type of pattern matching is string matching whereby locations of occurrences of one or more strings are identified within a larger string. Pattern matching finds applications in text searching such as bibliographic searching, DNA and protein sequence analysis, data mining, security systems such as intrusion detection systems, anti-virus software and machine learning problems.
An approach to string matching is described by Alfred Aho and Margaret Corasick in their 1975 paper “Efficient String Matching: An Aid to Bibliographic Search”. Known as the Aho-Corasick approach, the paper proposes a technique that involves the construction of a non-deterministic finite-state automaton as a pattern matching machine from search patterns (keywords). Each state of the automaton corresponds to a partial or complete sequence of symbols of a search pattern. The pattern matching machine is used to process a text string in a single pass to identify occurrences of search patterns in the text string. The Aho-Corasick approach employs a “goto” function and a “failure” function. The goto function maps a pair, consisting of a current state of an automaton and an input symbol from a text string, to a state or a “fail” condition. Thus the goto function effectively provides directed transitions between states in the automaton. The failure function is responsive to the fail condition of the goto function and maps a current state of the automaton to a new state. The new state is identified as a state of the automaton corresponding to a longest proper suffix of the pattern symbol sequence of the mapped state, where such a new state exists. If such a new state does not exist in the automaton, the failure function maps to a starting state of the automaton.
The Aho-Corasick algorithm provides for an approach to single-pass matching of multiple strings by providing the failure function for mapping states to appropriate new states in the event that the goto function returns fail. However, the Aho-Corasick approach is limited to determinate search patterns due to the dependence, by the failure function, on pattern suffixes to identify new states in the event of failure of the goto function. That is to say search patterns including non-determinate features, such as non-literal symbols including wildcard symbols, cannot be mapped to a new state on failure of the goto function due to the indeterminate nature of a wildcard symbol. Such wildcard symbols can potentially correspond to any symbol in a symbol alphabet (or subsets thereof), whereas the failure function of the Aho-Corasick algorithm is only effective for a determined proper suffix of symbols in a search pattern.
For example, search patterns embodied as expressions often include wildcard symbols, such as the ‘.’ metacharacter. Such expressions are found in many and varied applications including regular expressions, data validation, data extraction and search functions. Any existing approach to applying the Aho-Corasick algorithm to expressions including wildcards involves pre-processing and post-processing steps. During pre-processing, all sub-patterns of an expression that do not include wildcards are identified. An Aho-Corasick automaton is generated for each of the identified sub-patterns for use to identify the sub-patterns in an input symbol sequence. Subsequently, post-processing is required to determine if occurrences of the sub-patterns in the input sequence correspond to occurrences at particular offsets in accordance with the original expression. The requirement to undertake such pre- and post-processing for expressions imposes an undesirable resource and time constraint for the application of the Aho-Corasick approach.
Another non-determinate feature that can be employed in search patterns is an iteration feature, such as the metacharacter (indicating ‘zero or more’) and the ‘+’ metacharacter (indicating ‘one or more’). For example, the symbol pattern ‘ab*’ corresponds to a symbol sequence including an ‘a’ symbol followed by any number of (zero or more) ‘b’ symbols. Notably, the number of ‘b’ symbols is potentially infinite. Due to the variable number of symbols matched by a pattern matching automaton which can change for, and within, an input symbol pattern, it is not known how to apply the Aho-Corasick approach of failure state mapping to symbol patterns including iterative metacharacters since symbol suffixes cannot be known at the time of generating the automaton.
Thus it is desirable to provide the benefits of the Aho-Corasick algorithm for pattern matching of expressions including wildcards without the aforementioned disadvantages.