There are many different scenarios where it is desirable to identify documents in a document corpus that are variations or duplicates of another document. As one specific example, an electronic commerce (“e-commerce”) merchant might maintain a document corpus containing a large number of documents that store data describing products available from the merchant (e.g. product records). In this scenario, it may be desirable to identify duplicate documents in the document corpus in order to avoid confusing customers by presenting different records for the same product. Duplicate documents or documents that are variations of one another that are identified in the document corpus may be merged in order to eliminate any duplication or variation.
As another example, when an e-commerce merchant receives a document identifying a new product from a vendor for inclusion in the document corpus, it may be desirable to determine whether a document already exists in the document corpus that corresponds to the new product (i.e. is a duplicate of the document submitted by the vendor). If a duplicate document already exists in the corpus, a new document will not be created in the corpus for the product. If, however, a duplicate document does not already exist in the corpus, the document for the new product may be added to the corpus.
Existing mechanisms for detection of duplicate documents and document variants frequently require significant human involvement in the duplicate or variation detection process. This limitation may cause existing mechanisms for duplicate detection to be slower and more expensive than desirable. The disclosure made herein is presented with respect to these and other considerations.