1. Field of the Invention
This invention relates generally to data processing and, in particular, to methods and related systems for indexing the contents of documents for comparison with the contents of other documents to determine similarity.
2. Description of the Related Art
Traditionally, word processing programs and operating systems often have had the ability to compare the contents of files and provide information on differences or similarities in content between files. There are a variety of file comparison programs currently available, each of which may be adequate in certain respects, but have drawbacks which make them poorly suited for certain applications. The proliferation of Internet usage and the ease in which information can be posted, searched for, and retrieved from the Internet has resulted in the Internet becoming a primary source of information. This proliferation has resulted in an increased posting of copyrighted material on the Internet which has not been authorized. In addition, much of the information posted is not removed in a timely manner resulting in duplicate or near duplicate material on the Internet. As information becomes updated, previous versions of the information posted on the Internet may still remain, resulting in large quantities of outdated information on the Internet. While searching for material on the Internet, it may be desirable to identify and skip over such outdated content, or identify such outdated content so that it can be deleted. The proliferation of Internet usage has thus resulted in an increased need for methods and systems for comparing documents and identifying matching content.
Several methods of comparing files can be categorized as information retrieval methods, which compare statistical profiles of documents. For example, one method computes a histrogram of word frequencies for each document, or a histogram of the frequency of certain pairs or juxtaposition of words in a document. Documents with similar histograms are considered to be similar documents. Refinements of these methods include preprocessing of documents (e.g., removal of common or unimportant words) prior to statistical profile computation and applying the same information retrieval method to subsections of documents. A primary limitation of information retrieval methods is that they have tendencies to provide false positive matches which are difficult to prevent, since dissimilar documents may often have similar statistical profiles.
Another method of comparing documents is known as document xe2x80x9cfingerprintingxe2x80x9d, which involves computing hashes of selected substrings of documents. A particular set of substring hashes chosen to represent a document is the document""s fingerprint. Documents are compared by comparing the substring hashes making up the fingerprints of the documents. The more substring hashes chosen, the more accurate the document""s fingerprint for comparison to another document. However, if too many hashes are chosen, the data processing system may be unable to handle large quantities of documents. The similarity of two documents is defined as a ratio C/T where C is the number of hashes the two documents have in common and T is the total number of hashes saved from one of the documents. Assuming a well-behaved hash function, this ratio is a good estimate of the actual percentage overlap between the two documents. However, this also assumes that a sufficient number of substring hashes are saved.
In the past, various approaches have been used to determine which substrings in a document are selected for hashing and which of these hashes are saved as part of the document fingerprint. One approach is to compute hashes of all substrings of a fixed length k and retain those hashes that are evenly divisible by some integer p, 0 mod p for some integer p. A second approach is to partition the document into substrings with hashes that are 0 mod p and save those hashes. In this second approach, the substrings selected are not of a fixed length. Rather, a character is added to the substring until the hash of the substring is 0 mod p, at which point the hash is saved and the next substring is started.
However, because these methods depend on the hash values of the document substrings in determining which hash values are saved, there may be large gaps in a document where no hash value will be saved and there may be portions where an excess of hash values are saved. If gaps between stored hash values are too long, a document""s fingerprint may be too faint for accurate comparison with other documents. In addition, there may potentially be a situation where an entire document is bypassed without having a single substring hash value saved for a fingerprint, and where another document has more hashes than necessary saved for a fingerprint.
Current methods of selecting substring hash values have been unable to strike a balance between saving a sufficient number of hash values adequate to index the contents of a document, but not saving an unnecessary number of hash values limiting system capacity.
Once a sufficient number of substring hash values saved are adequate to index the contents of a document, the hash values are sorted by value to generate an indices that can be quickly queried to identify matching content. For data sets having no special properties, standard algorithms used to sort a data set of N hash values representing the contents of documents require an amount of time proportional to N(log N). The log N factors results from the need to recursively sort and merge smaller and smaller problem sizes, with each instance being about xc2xd the size of the previous one. N hash values can be subdivided in half at most log N times. While the log N factor is inconsequential for small data sets, the log N levels of recursive sorting may contribute over one order of magnitude to the cost of sorting for large sets of hash values. This cost of sorting may become prohibitive as the sets of hash values to be sorted becomes large. As a result, for large data sets of hash values, there has been a need for methods and related systems for faster sorting in order to generate the required indices to be used to identify matching content.
These generated indices of sorted hash values are saved to disk, and current methods of querying the indices require a disk input/output to access the contents of the indices. Because of the time required to perform a disk input/output for each hash value to be queried against the indices greatly limits the speed in which queries of the indices can be performed, there has been a need for methods and related systems for faster querying of a disk based indices of hash values.
The present invention encompasses data processing methods and related systems for indexing the contents of documents for comparison with the contents of other documents to identify matching content.
A method for comparing the contents of a query document to the content on the World Wide Web is set forth. The contents of a query document are indexed and compared to content from the World Wide Web which is continuously retrieved and indexed. The method for indexing the contents of a document may comprise selecting substrings from the document, hashing the substrings to generate a plurality of hash values having a known range of values, selecting certain hash values to save from the generated hash values, and sorting the saved hash values. Methods for selecting certain hash values to save are set forth.
Another aspect of the invention sets forth a system for detecting partially or wholly duplicated documents on the World Wide Web. The system comprises a plurality of servers, with each server containing the indexed contents of a plurality of Universal Resource Locator pages, and a user interface for querying the indexed contents of the Universal Resource Locator pages.
Yet another aspect of the invention sets forth another method for comparing the contents of a query document to the content on the World Wide Web. The contents of a plurality of Universal Resource Locator pages from the World Wide Web are indexed and store on a plurality of servers. The contents of a query document are indexed and compared to the index of contents of the Universal Resource Locator pages from the World Wide Web.
The present invention is explained in more detail below with reference to the drawings.