This invention relates generally to the field of computerized processing of data objects, and more particularly to extracting features from data objects so that like data objects can be identified.
In many computer systems, it useful to determine the resemblance between objects such as data records. The data records can represent text, audio, or video signals For example, Internet search engines maintain indices of millions of data records in the form of multimedia documents called Web pages. In order to make their Web pages more xe2x80x9cvisible,xe2x80x9d some users may generate thousands of copies of the same document hoping that all documents submitted will be indexed.
In addition, duplicate copies of documents may be brought into different Web sites to facilitate access, this is known as xe2x80x9cmirroring.xe2x80x9d That is, identical, or nearly identical documents are located at different Web addresses. Other sources for xe2x80x9calmostxe2x80x9d duplicate documents arise when documents under go revision, documents are contained in other documents, or documents are broken into smaller documents.
A search engine, such as the AltaVista(trademark) search engine, can greatly reduce the amount of disk used for storing its index when only a single copy of a document is indexed. The locations of other copies or nearly identical versions of the document can then be associated to the stored copy. Therefore, it is useful to determine to what extent two documents resemble each other. If a new document to be indexed highly resembles a previously indexed document, then the content of the new document does not need to be indexed, and only its location needs to be linked to the previously indexed document.
Classically, the notion of similarity between arbitrary bit strings has been expressed as a distance, for example, the Hamming distance or the edit distance. Although these distance metrics are reasonable for pair-wise comparisons, they are totally inadequate at the scale of the Web where the distance between billions of pairs of documents would need to be measured.
In U.S. Pat. No. 5,909,677 filed by Broder et al. on Jun. 18, 1996, a method for determining the resemblance of documents is described. The method measures to what extent two documents are xe2x80x9croughlyxe2x80x9d the same. The AltaVista(trademark) search encine uses this method to discard approximately 10K pages out of the 20K daily submissions. As an advantage, the method does not require a complete copy of the content of documents to be compared. That would waste storage as well as processing time. Instead, the method stores a small xe2x80x9csketchxe2x80x9d that characterizes the document.
The method works by processing the document to abstract the content of the document into a sketch. For example, the content of complex documents expressed as many thousands of bytes can be reduced to a sketch of just hundreds of bytes. The sketch is constructed so that the resemblance of two documents can be approximated from the sketches of the documents with no need to refer to the original documents. Sketches can be computed fairly fast, i.e., linear with respect to the size of the documents, and furthermore, given two sketches, the resemblance of the corresponding documents can be computed in linear time with respect to the size of the sketches.
Documents are said to resemble each other when they have the same content, except for minor differences such as formatting, corrections, capitalization, web-master signature, logos, etc. The resemblance can be expressed as a number between 0 and 1, defined precisely below, such that when the resemblance of two documents is close to one it is likely that the documents are roughly the same, and when the resemblance is close to zero, they are significantly dissimilar.
When applying this method to process the entire Web, which is roughly estimated to have hundreds of million of documents, the cost of computing and scoring the sketches is still prohibitive. In addition, since the data structures that need to be stored and manipulated count in the hundreds of millions, efficient memory operations are extremely difficult, particularly when they have to be performed in a reasonable amount of time.
Therefore, it is desired to provide a method that can determine when the resemblance of documents is above a certain threshold using less storage, and less processing time.
Provided is a computer-implemented method for determining the s resemblance of data objects such as Web pages indexed by a search engine connected to the World Wide Web. Each data object is partitioned into a sequence of tokens. The tokens can be characters, words, or lines of the data objects. Overlapping sets of fixed number of tokens of each object are grouped into shingles.
Each shingle is assigned a unique identification and viewed as an element of a set associated with the data object. The unique identifications can be determined by, for example, digital fingerprinting techniques. A plurality of pseudo random permutations are applied to the set of all possible unique identifications to generate permuted images of the sets. For each data object, a minimum or smallest element from the image of its associated set under each of these permutations is selected. The elements so selected constitute a sketch of the data object. The sketches characterize the resemblance of the data objects. A typical sketch comprises five hundred to a thousand bytes.
In addition, the selected elements that form the sketch can be partitioned into a plurality of groups, for example, six groups for each sketch. The groups are fingerprinted again to thus generate a plurality of, for example, six features that further characterize the resemblance of the data object. The vector of features associated with a data object would typically comprise thirty to a hundred bytes.
A first and a second data object are designated as fungible when the first and second data object share more than a certain threshold, for example, two of their features. Fungible data objects are collected into clusters of highly resembling data objects. For some types of data objects fungibility can be based on more than two common features.
In one aspect of the invention, frequently occurring shingles can be eliminated. If the data objects are Web pages, example frequent shingles are HTML comment tags that identify the program that generated the Web page, shared headers or footers, common text such as sequences of numbers, etc.
In another aspect of the invention, the parsing, grouping, representing, selecting, partitioning, and fingerprinting, are performed with first parameters to determine the tokens, shingles, fingerprints, groups, and features. These steps can then be repeated with second parameters to perform variable threshold filtering of the data objects.