Data is often organized as large collections of objects. When objects are added over time, there are often problems with data duplication. For example, a collection may include multiple objects that represent the same entity. As used herein, the term “duplicate objects” refers to objects representing the same entity. The names used to describe the represented entity are not necessarily the same among the duplicate objects.
Duplicate objects are undesirable for many reasons. They increase storage cost and take a longer time to process. They lead to inaccurate results, such as an inaccurate count of distinct objects. They also cause data inconsistency.
Conventional approaches identifying duplicate objects assume a homogeneity in the input set of objects (all books, all products, all movies, etc). Identifying duplication for objects of different type requires looking at different fields for different type. For example, when identifying duplicate objects in a set of objects representing books, traditional approaches match the ISBN value of the objects; when identifying duplicate objects in objects representing people, traditional approaches match the SSN value of the objects. One drawback of the conventional approaches is that they are only effective for specific types of objects, and tend to be ineffective when applied to a collection of objects with different types. Also, even if the objects in the collection are of the same type, these approaches tend to be ineffective when the objects include incomplete or inaccurate information.
What is needed is a method and system that identifies duplicate objects in a large number of objects having different types and/or incomplete information.