Data is often organized as large collections of objects. When the objects are added over time, there are often problems with data duplication. For example, a collection may include multiple objects that represent the same entity. As used herein, the term “duplicate objects” or any variation thereof, is intended to cover objects representing the same entity. Duplicate objects are not necessarily identical; they can have different facts or different values of the same facts.
Duplicate objects are undesirable for many reasons. They increase storage cost and take a longer time to process. They lead to inaccurate results, such as an inaccurate count of distinct objects. They also cause data inconsistency. For example, subsequent operations affecting only some of the duplicate objects cause objects representing the same entity to be inconsistent.
Traditional approaches to identify duplicate objects assume a homogeneity in the input set (all books, all products, all movies, etc), and compare different facts of objects to identify duplication for objects of different types. For example, when identifying duplicate objects in a set of objects representing books, traditional approaches match the ISBN value of the objects; and when identifying duplicate objects in objects representing people, traditional approaches match the SSN value of the objects. One drawback of the traditional approaches is that they are only effective to specific types of objects, and tend to be ineffective when applied to a collection of objects with different types. Also, even if the objects in the collection are of the same type, these approaches are ineffective when the objects include incomplete or inaccurate information.
For these reasons, what is needed is a method and system that identifies duplicate objects in a large number of objects having different types and/or incomplete information.