There are many different scenarios where it is desirable to identify documents in a document corpus that are duplicates of another document. As one specific example, an electronic commerce (“e-commerce”) merchant might maintain a document corpus containing a large number of documents that store data describing products available from the merchant. In this scenario, it may be desirable to identify duplicate documents in the document corpus in order to avoid confusing customers by presenting different records for the same product. Duplicate documents that are identified in the document corpus may be merged in order to eliminate any duplication.
As another example, when an e-commerce merchant receives a document identifying a new product from a vendor for inclusion in the document corpus, it may be desirable to determine whether a document already exists in the document corpus that corresponds to the new product (i.e. is a duplicate of the document submitted by the vendor). If a duplicate document already exists in the corpus, a new document will not be created in the corpus for the product. If, however, a duplicate document does not already exist in the corpus, the document for the new product may be added to the corpus. In other scenarios, it might also be desirable to identify the duplicate documents contained in two or more document corpora.
Traditional mechanisms for identifying duplicate documents, such as those that attempt to identify duplicate documents based upon the frequency of terms contained therein, do not perform well for document corpora having millions or even hundreds of millions of documents. The disclosure made herein is presented with respect to these and other considerations.