The exemplary embodiment relates to systems and methods for identifying matching subsequences in a sequence of symbols, and finds application in representing a textual document using identified matching subsequences, for searching and interpretation of documents, such as classifying the textual document, or for comparing or clustering of documents.
Pattern matching is widely used for searching for subsequences occurring in a sequence of symbols, such as a sequence of words in a text document. Given a sequence of symbols S, a naïve algorithm for pattern matching simply tries to match a pattern p at each position i of S. If, for example the pattern p=abc and the sequence S=abdabc, the first alignment would fail because S has a d in its third position, compared to a c for p. The pattern would then be shifted by one, and bda (substring starting at position 2 of S) would be compared to p. However, at this stage, it is already known that S does not contain an a (the first letter of the pattern) in the three first positions, so it is known that aligning p to either of the next two positions will fail.
There are several more advanced exact string matching algorithms for performing pattern matching. Among them are the Knuth-Morris-Pratt (KMP) algorithm (see, Knuth, Jr., et al., “Fast pattern matching in strings,” SIAM Journal on Computing, 6(2) pp. 323-350 (1977)) and the Boyer-Moore algorithm (see, Boyer, et al., “A fast string searching algorithm,” Comm. ACM, 20(10), pp. 762-772 (1977)). Both improve over a naive algorithm by taking advantage of the information of a mismatch, shifting over parts of S on which it can be certain that a match will not occur. In the KMP algorithm, a failure table is created by preprocessing the pattern. This table gives, for each position j of p, the position to look for a new match if the current one fails. So, when ababc is matched against abababc, the first alignment (at position 1 of S) will fail at position 5 of p (c≠a), but the failure table will indicate that the next alignment should start at position 3 of S, because it is observed that there is an ab starting at that position which coincides with the start of p.
As can be seen from the example, this algorithm takes advantage of repetitions inside p (in particular, repetitions of any prefix of p). In the case of the Boyer-Moore algorithm, this is made more explicit. The original version (“BM1”) has two different tables, and a second version (“BM2”) focuses precisely on handling repetitions of suffixes (because the Boyer-Moore compares from right to left). However, such repetitive patterns are unusual in practice, as observed by Horspool (see, Horspool, “Practical fast searching in strings,” Software—Practice and Experience, vol. 10, pp. 501-506 (1980)). Horspool shows how a simplified version of the Boyer-Moore algorithm (“BMH,” dropping the second rule, and further simplifying the first one) can sometimes provide results that are better than BM1 and BM2, because the algorithm shortens the pre-processing time. A comparison of different exact simple string pattern matching is provided in Frakes, et al. (“Information Retrieval: Data Structures and Algorithms,” Prentice-Hall, Inc. (1992). Frakes compares different exact simple string pattern matching methods by plotting computation time vs. pattern length. The plot shows that the KMP algorithm is fairly independent of length while other methods, such as the BM1 and BM2 versions and the BMH perform worse at short pattern lengths but have increasingly lower computation times than the KMP as pattern length increases.
The above methods are useful in identifying sequences of characters, such as words, in a document. Pattern matching may also be used to identify other types of annotation. For example, a document may be annotated with parts-of-speech (POS), such as noun, verb, etc., or more discrete POS, such as singular noun (NN), plural noun (NNS), verb in past tense (VBD), etc. A search for a predefined sequence of POS in an annotated document may then be performed. For example, using a pattern represented by the POS sequence NN VB NNS, a search of a POS sequence could be performed. In some cases, matching a pattern that includes a mixture of annotation types could provide more useful information. For example, a user may be interested in retrieving all occurrences of Carlson invented NN. Or, a user may be interested in the retrieval of records from a collection of trivia annotated with POS and where each word is annotated as to whether it is a superlative (SUP) or not. As an example, the objective could be to retrieve occurrences of a pattern such as NNS is the SUP NNS of France, which could yield the specific instances: Mont Blanc, highest, mountain; Bettencourt, wealthiest, person; Saint-Cirq-Lapopie, prettiest, village; and so forth.
Each of the different types of annotation (including the words themselves or their root forms) can be considered as a view of the sequence. For a sequence of symbols S, there thus may be k different views over the sequence. A query such as those in the examples which involves more than one of the views involves what is referred to herein as multi-view pattern matching, because the pattern can contain any combination of the possible views. Current pattern matching methods, such as the KMP, BM1 and BM2 and BMH methods, do not address the issue of multiple views. This is particularly a problem for retrieving regular expressions, where the variety lies in the sequence, not in the pattern. Some studies of inexact (or fuzzy) pattern matching have been made, including cases when p includes wildcards (fixed or extensible), multi-character symbols and regular expressions. However, these approaches are not applicable to the case of exact pattern matching.
There remains a need for an efficient method for finding occurrences of a multi-view pattern in a multi-view sequence.