The detection of whether a document is similar to another document in a document collection is becoming an important problem due to the tremendous growth of the Internet and data portals (see, e.g., [1]). Document collections are increasing in both the number of documents collections and in the number of documents in document collections due to the ease of transmitting and receiving documents via the Internet and data portals. As the size of a document collection increases, the probability of similar documents being re-submitted or re-indexed in the document collection increases as well. Maintaining similar documents in a document collection not only drains valuable resources for the computation and storage of indices for the document collection but also affects the collection statistics and, hence, potentially the accuracy of searching the document collection.
Storing similar documents in a document collection affects both the accuracy and efficiency of an information search and retrieval engine used with the document collection. Retrieving similar documents in response to a user's query potentially lowers the number of valid responses provided to the user, which thereby lowers the accuracy of the user's response set. Further, processing similar documents necessitates additional computation without introducing any additional benefit to the user, which lowers the processing efficiency of the user's query.
Additionally, similar documents skew collection statistics of the document collection. Collection statistics are typically used as part of a similarity computation of a query for the document collection. With similar documents in the document collection, the collection statistics of the document collection are biased and may affect the overall precision of the document collection and its information search and retrieval engine.
The need to detect similar documents arises in various types of document collections. As an example, for a document collection of documents received via the Internet, similar documents are undesirable additions to the document collection, and similar document detection would be useful prior to adding another document to the document collection. As another example, for a document collection of classified documents, similar documents need to be identified for either declassification or classification, and similar document detection would be useful for document declassification or document classification. As another example, for a document collection of electronic mail (e-mail) documents, similar e-mail documents need to be identified for
processing the document collection, and similar document detection would be useful for e-mail traffic processing, which may be continuous. As a potential difficulty with processing a document collection of e-mail documents, many e-mail documents have a short length and, hence, may prove difficult to detect as similar documents.
As another example of the need to detect similar documents, consider searching web documents (e.g., documents available over the Internet via the world wide web format), which typically have a short length (e.g., typically around 4 kilobytes (KB) (see, e.g., [1])). With web documents, one might believe that matching the uniform resource locator (URL) would identify similar documents. However, because many web sites use dynamic presentation, where the content changes depending on the region or other variables, relying on the URL is of little value. Further, data providers often create names for one web site to attract users with different interests or perspectives. For example, the web sites www.fox4.com, onsale.channel9.com, and www.realtv.com all point to an advertisement for realTV.
As another example of the need for detection of similar documents, similar documents can populate a document collection when multiple document sources are used. For instance, the National Center for Complementary and Alternative Medicine (NCCAM) (see, e.g., [2]), supports an information search and retrieval engine for a document collection of medical data having inputs from multiple sources of medical data. Given the nature of the medical data, similar documents in the document collection can be common. Because unique document identifiers are most likely not possible when the document identifiers originate from different sources, the detection of similar documents is essential to produce non-redundant results for the information search and retrieval engine.
Conventional techniques for detecting similar documents can be divided into three categories: shingling techniques; similarity measure techniques; and image processing techniques. As the first category, shingling techniques were developed by various researchers, for example: DSC [3], COPS [4]; SCAM [5], [6], and [7], which is a successor to COPS; and KOALA [8]. The shingling technique uses a set of contiguous terms, or shingles, for a document and compares the number of matching shingles. The shingles can be considered to be subdocuments for the document. With the comparison of subdocuments between two documents, a percentage of overlap is calculated between the two documents. For the shingling technique, a hash value is determined for each subdocument, and the hash values for each subdocument are filtered to reduce the number of comparisons performed, which improves the runtime performance of the shingling technique. With the shingling technique, a document is not compared to other documents, and instead, subdocuments are compared. By using subdocuments, instead of documents, each comparison may produce many potential similar documents, With the shingling technique, by returning many potential matches of similar documents, a large amount of user involvement is required to sort potential similar documents, which dilutes the usefulness of the shingling technique.
To overcome the basic efficiency issues with the shingling technique, several optimization techniques for the shingling technique were proposed to reduce the number of comparisons made. For example, removing frequently occurring shingles (see, e.g., [8]) and retaining only every twenty-fifth single (see, e.g., [3]) were proposed. With these optimization techniques, the computation time of the shingling technique is reduced. However, because no semantic premise is used to reduce the volume of data, a degree of randomness is introduced to the comparison process, which results in relatively non-similar documents being identified as potential similar documents.
In terms of computational time complexity, the shingling technique has order O(kd log(kd)), where k is the number of shingles per document, and d is the number of documents in the document collection. Even with the performance-improving technique of removing shingles occurring in over 1,000 documents and keeping only every twenty-fifth shingle, the implementation of the DSC took 10 days to process 30 million documents [3].
As an alternative to the DSC shingling technique, the DSC-SS shingling technique was proposed [3]. The DSC-SS shingling technique uses super shingles, in which several shingles are combined in a super shingle, which results in a document having a few super shingles, instead of many shingles. With the DSC shingling technique, the similarity between documents was measured as a ratio of matching shingles in two documents, and with the DSC-SS shingling technique, the similarity between two documents is measured using one super shingle for each document. Using a single super shingle is more efficient than using multiple shingles because a full counting of all overlaps between shingles is no longer required.
In terms of computational time complexity, the run time for DSC-SS shingling technique is of the order O(kd log(kd)), although k is significantly smaller for the DSC-SS shingling technique compared to the k for the DSC shingling technique. Further, the amount of computations required to count overlap is eliminated with the DSC-SS shingling technique, which reduces the overall runtime. Nonetheless, the DSC-SS shingling technique reportedly does not work well for documents having a short length. Moreover, the shingling technique and its optimization attempts are very sensitive to adjustments in the size of shingles and the frequency of retained shingles.
As the second category, similarity measure techniques were proposed in, for example, [9] and [10]. The similarity measure techniques are similar to prior work done in document clustering (see, e.g., [11]). A similarity measure technique uses similarity computations to group potentially similar documents and compares each document pair-wise. Because of the pair-wise comparison, a similarity measure technique is computationally prohibitive because the computational time complexity is of the order 0(d2), where d is the number of documents.
To make the similarity measure technique computationally feasible, document terms are identified for each document, and a document collection is searched using the document terms. With the enhanced similarity measure technique, document terms are initially identified for a document to be compared to the document collection. Each term for the document is used to search the document collection, and a final weight is produced for each document in the document collection having a matching term. The document in the document collection having the largest weight is determined to be the most similar document. By using the document as a query with the similarity measure technique, clustering of the documents results. Even the enhanced similarity measure technique becomes computationally unfeasible for a large or dynamic document collection because each document must be queried against the entire collection.
As the third category, image processing techniques were proposed in, for example, [13] and [14]. An image processing technique processes documents as images to determine similar documents. The image processing technique maps the similar document detection problem into an image-processing domain, rather than into the text-processing domain as with the shingling technique, the similarity measure technique, and the parsing filtering technique.
There exists a need for a technique to detect whether a document is similar to another document in a document collection, where the technique is scalable to and computationally feasible for any size of document and any size of document collection.