The exemplary embodiment relates to systems and methods for identifying repeat subsequences in a sequence of symbols where the repeat subsequences satisfy a threshold context diversity, and finds application in representing a textual document using identified repeat subsequences for interpretation of documents, such as classifying the textual document, and comparing or clustering of documents.
Inferring constituents, such as a set of repeated words or sequences of words, is a basic step for many applications involving textual documents. These are the semantic blocks that define the meaning of a document. They can be used to represent the document, and an accurate description of a document is beneficial to tasks such as classification, clustering, topic detection, and knowledge extraction. They are also useful in inferring the structure of a document. In grammatical inference, where it is assumed that the document samples are generated by a grammar, it is useful to determine which sequences of the document correspond to the same grammatical constituent before detecting how different rules are related to each other.
The standard approach for extracting features and creating representations for textual documents is called the “bag-of-words,” where each dimension in a vector space model represents one word. To consider longer sequences, higher level language model such as n-grams, may be used. However, such methods do not consider the context in which the sequence appears. Context, as used herein, refers to the constituents immediately to the left and right of a given constituent. In the case of a sequence of words, for example, the left context includes the word (or a sequence of words) that is positioned immediately to the left of an occurrence of the sequence and the right context includes the word (or a sequence of words) that is positioned immediately to the right of the occurrence of the sequence.
Algorithms have been developed which have some notion of context. As an example, Solan, et al., describes a system referred to as ADIOS which uses the fraction of different contexts in which a substring appears as a feature to decide on a set of constituents. (See, Zach Solan, et al., “Unsupervised learning of natural languages,” Proc. Nat'l Academy of Sciences, vol. 102, no. 33, pp. 11629-11634 (2005). Another approach is Zellig Harris substitutability theory, which is related to the idea of context of a constituent. An implementation of this theory is described in Menno van Zaanen, “ABL: Alignment-based learning,” Intern'l Conf. on Computational Linguistics (COLING), pp. 961-967 (2000). Another approach uses a mutual information criterion (see, Alexander Clark, “Learning deterministic context free grammars: The Omphalos competition,” Machine Learning, pp. 93-110 (2007); and Clark, et al., “A polynomial algorithm for the inference of context free languages,” 9th International Colloquium on Grammatical Inference: Algorithms and Applications (ICGI), pp. 29-42 (2008)). Such methods, however, rely on computationally expensive algorithms to detect constituents.
There remains a need for a system and method for detection of representative constituents of text documents which allows context of repeat subsequences to be considered in a computationally efficient manner.