The exemplary embodiment relates to systems and methods for incrementally updating a data structure, such as a tree, to compute repeat subsequences in a sequence of symbols. It finds particular application in representing a textual document using identified repeat subsequences for interpretation of documents, such as classifying the textual document, and comparing or clustering of documents.
Inferring constituents, such as a set of repeated words or sequences of words, is a basic step for many applications involving textual documents. These are the semantic blocks that define the meaning of a document. They can be used to represent the document, and an accurate description of a document is beneficial to tasks such as classification, clustering, topic detection, and knowledge extraction. They are also useful in inferring the structure of a document. In grammatical inference, where it is assumed that the document samples are generated by a grammar, it is also useful to determine which sequences of the document correspond to the same grammatical constituent before detecting how different rules are related to each other.
The standard approach for extracting features and creating representations for textual documents is called the “bag-of-words,” where each dimension in a vector space model represents one word. To consider longer sequences, higher level language models, such as n-grams, may be used.
Document processing is often batch mode, but limiting processing to batch mode reduces the application space. For many applications (such as online news analysis, real-time classification, etc.) it is common to have a set of already processed documents d1, . . . , dk to which is added a new document dk+1. Despite sophisticated algorithms and implementations, computing the set of repeats in a batch mode on the successive sets {d1}, {d1,d2}, . . . is not readily scalable. Additionally, a repeat sequence which is not observed in the original set of documents {d1, . . . , dk} may be present later when a further document is added to the collection.
There remains a need for a system and method for detection repeats in a document collection in which the set of documents is not static. For example, documents are added to a pool of documents one at a time or in small batches. The method described permits updating the count of existing repeats, and computing new ones, in such a streaming framework.