1. Field of the Invention
The present disclosure relates to computers and more specifically to methods for improving a computer's ability to locate similar documents within a large set of documents.
2. Description of the Related Art
Finding content similarity over large sets of documents is a well-studied area. Two distinct approaches to finding similar documents within a large document set include 1) using semantic information for a document comparison between seed and target documents; and 2) use of a term frequency method to represent documents as mathematical vectors so that seed document vectors and target document vectors can be easily compared.
Use of semantic information typically involves using linguistic methods to analyze document sets and perform comparisons based on the linguistic analysis. Given reasonable training sets and time, these approaches can be effective in determining the similarity of documents. The main drawback with these approaches tends to be performance. These methods are typically computationally expensive, thus limiting the number of documents that can be effectively analyzed. Another challenge is the need for retraining if the domain of the documents shifts or changes.
The second general approach to finding documents with similar content is to use a term frequency method to represent a document as a mathematical vector which can easily be compared to other vectors. There are a wide variety of methods to perform this transformation from documents to vectors, and a wide variety of methods to compare the similarity of the resulting vectors. These term frequency methods are typically faster than linguistic methods, but can easily mislabel documents.
For very large document sets, even term frequency based methods have performance challenges. The widely used term frequency-inverse document frequency (TF/IDF) method requires a count of documents that contain a given term in order to calculate a weight for that term, meaning that the inverse document frequency (how many documents contain a given term), must be calculated for every unique term in the set before a given document vector can be created. This forces process serialization over a set of documents.
The term frequency-inverse corpus frequency (TF/ICF) method addresses the performance challenges of TF/IDF, and results show that it can provide more accurate results on homogenous data sets than does TF/IDF, however, it still retains the weakness of mislabeling similar documents as does TF/IDF or any term frequency based method.
Further improvements can advance the state of the art.