In many computer systems, it useful to determine the resemblance between objects such as data records. The data records can represent text, audio, or video signals. For example, Internet search engines maintain indices of millions of data records in the form of multimedia documents called Web pages. In order to make their Web pages more "visible," some users may generate thousands of copies of the same document hoping that all documents submitted will be indexed.
In addition, duplicate copies of documents may be brought into different Web sites to facilitate access, this is known as "mirroring." That is, identical, or nearly identical documents are located at different Web addresses. Other sources for "almost" duplicate documents arise when documents under go revision, documents are contained in other documents, or documents are broken into smaller documents.
A search engine, such as the AltaVista.TM. search engine, can greatly reduced the amount of disk used for storing its index when only a single copy of a document is indexed. The locations of other copies or nearly identical versions of the document can then be associated to the stored copy. Therefore, it is useful to determine to what extent two documents resemble each other. If a new document to be indexed highly resembles a previously indexed document, then the content of the new document does not need to be indexed, and only its location needs to be linked to the previously indexed document.
Classically, the notion of similarity between arbitrary bit strings has been expressed as a distance, for example, the Hamming distance or the edit distance. Although these distance metrics are reasonable for pair-wise comparisons, they are totally inadequate at the scale of the Web where the distance between billions of pairs of documents would need to be measured.
In U.S. Pat. No. 5,909,677 filed by Broder et al. on Jun. 18, 1996, a method for determining the resemblance of documents is described. The method measures to what extent two documents are "roughly" the same. The AltaVista.TM. search engine uses this method to discard approximately 10K pages out of the 20K daily submissions. As an advantage, the method does not require a complete copy of the content of documents to be compared. That would waste storage as well as processing time. Instead, the method stores a small "sketch" that characterizes the document.
The method works by processing the document to abstract the content of the document into a sketch. For example, the content of complex documents expressed as many thousands of bytes can be reduced to a sketch of just hundreds of bytes. The sketch is constructed so that the resemblance of two documents can be approximated from the sketches of the documents with no need to refer to the original documents. Sketches can be computed fairly fast, i.e., linear with respect to the size of the documents, and furthermore, given two sketches, the resemblance of the corresponding documents can be computed in linear time with respect to the size of the sketches.
Documents are said to resemble each other when they have the same content, except for minor differences such as formatting, corrections, capitalization, web-master signature, logos, etc. The resemblance can be expressed as a number between 0 and 1, defined precisely below, such that when the resemblance of two documents is close to one it is likely that the documents are roughly the same, and when the resemblance is close to zero, they are significantly dissimilar.
When applying this method to process the entire Web, which is roughly estimated to have hundreds of million of documents, the cost of computing and scoring the sketches is still prohibitive. In addition, since the data structures that need to be stored and manipulated count in the hundreds of millions, efficient memory operations are extremely difficult, particularly when they have to be performed in a reasonable amount of time.
Therefore, it is desired to provide a method that can determine when the resemblance of documents is above a certain threshold using less storage, and less processing time.