One or more embodiments disclosed within this specification relate to de-duplication of data within a search space.
Many organizations maintain extensive databases to track a variety of different types of data such as, for example, customer data, inventory data, or the like. Having accurate, e.g., high quality, data is often of significant importance. One aspect of maintaining quality data relates to a process referred to as de-duplication. De-duplication refers, in general, to the identification and elimination of duplicate records within a database.
De-duplication can be a complex undertaking for a variety of reasons. For example, in many cases, the sheer size of the database to undergo de-duplication means that the number of comparisons necessary to identify duplicate records can be computationally expensive or even unreasonable. In addition, many duplicate records include one or more fields that do not match exactly, making the determination of whether one record is a duplicate of another record difficult.