Data is often organized as large collections of objects. When the objects are added over time, there are often problems with data duplication. For example, a collection may include multiple objects that represent the same entity. Objects are duplicate objects if they represent the same entity, even if the information about the entity contained in the objects is different. Duplicate objects increase storage cost, take longer time to process, and confuse the display of information to the user. Duplicate objects can also lead to inaccurate results, such as an inaccurate count of distinct objects.
Some applications determine whether two objects are duplicate objects by comparing the value of a specific fact, such as the Social Security Number (SSN), the International Standard Book Number (ISBN), or the Universal Product Code (UPC). This approach is effective when all objects contain one of such facts. The specific facts used for comparison are analogous to the primary keys of database tables in relational databases. But for objects built on incomplete information, some objects may not have any of these facts. Also, when values of such facts associated with either of the two objects are inaccurate, this approach treats the two objects as distinct objects even if other facts associated with the two objects indicate that they are duplicate objects. Thus, this approach only determines whether two objects are duplicate objects when both objects include accurate and complete information.
Some other applications identify whether two objects are duplicate objects by comparing all common facts of the two objects. The two objects are determined to be duplicate objects when the number of matching common facts exceeds a threshold. This approach is problematic because it does not always give accurate results. For example, the chance that two objects sharing the same gender being duplicates are much lower than that of two objects sharing the same date of birth. By treating all facts equally, this approach is both over-inclusive by identifying distinct objects sharing many facts with little indicating value, and under-inclusive by excluding duplicate objects sharing few facts with great indicating value.
For these reasons, what is needed is a method and system that determines whether two objects built from imperfect information are duplicate objects.