The exemplary embodiment relates to systems and methods for representing a textual document using identified repeats, which facilitates interpretation of documents such as classifying the textual document, and comparing or clustering of documents.
In order to perform mathematical operations on a document, the first step is to define a mapping that represents the document as a data structure. For example, sequences are standard representations for documents in the bioinformatics field. However, the high dimensionality and lack of flexibility inherent in sequence representations make them unpopular for natural language applications. Accordingly, the most popular representation for documents in natural language applications is the vector space model. In the vector space model, a document d is mapped into a vector v(d)εD. Normally, such a mapping proceeds by extracting a defined set of features from d, which are subsequently weighted through a scoring scheme.
The standard approach for extracting features and creating representations for textual documents is called the “bag-of-words,” where each dimension in the vector space model represents one word. However, this standard approach of just counting words, or unigrams, has some well-known shortcomings. For instance, it is a lossy representation and as such can map different documents into the same representation. While less problematic in query systems, the bag-of-words approach becomes an issue when comparing full-length documents. Another drawback of counting single words is that multi-words expressions (collocations) are missed. Thus, a document where “New York” occurs may not be recognized as different from one which contains separate occurrences of the words “New” and “York.”
One way of addressing the issues with the bag-of-words and unigrams is by using a higher level language model such as n-grams. However, the n-gram approach introduces other issues, e.g., the dimension of the vocabularies increases exponentially with n, which decreases computational efficiency. Additionally, the vector representation becomes much sparser, which makes it difficult to compute similarities. In general, n-gram models with n≧3 become so problematic that performance decreases considerably.
There remains need for a more principled way of using n-grams in document representations, without the limitation of a fixed size and the derived shortcoming of missing shorter or longer terms.