1. Field of the Invention
The present invention relates to a method and system for identifying relationships among data records in a data repository and, in particular, to methods and systems for automatically determining data records that are related in a direct, or embedded manner and automatically removing duplicate data records.
2. Background Information
Very large databases are often plagued by a problem of the presence of duplicate data records, which can prove quite costly to manufactures, vendors, and other companies who distribute literature or products or who contact individuals based upon the presence of records in a database. For example, marketers attempting to target an audience with a promotional offer may end up spending unnecessary funds by sending duplicate marketing literature to multiple persons in the same business or household. A potentially larger problem is the annoyance factor to the recipients of the literature, who, upon receiving multiple copies of the same literature, may respond adversely and ignore the literature, thus defeating its intended purpose. As another example, multiple records of the same content inadvertently stored under minor variations of a field used to index the records may cause the data to be difficult to retrieve. For instance, duplicate billing charges to a customer's account, when the account information is not entered consistently, could end up being difficult if not impossible to track when the customer calls to question charges reflected on an invoice. There are many other examples of similar problems arising from an inability to correctly detect and remove duplicates in a practical sense, especially when the data is related but not identical.
Current systems for removing duplicates from a database are limited to matching the values of one or more fields. How uniformly the data is entered affects the successfulness of such matches greatly. Also, the process for removing duplicates is to post-process records in the database after incorrect information has already been entered. Once incorrect information has been entered, it is typically more difficult to detect and clean out. Moreover, much of the de-duplication process is performed manually, by comparing one data record against another; if the correct two records aren't compared, a duplicate may go undetected. The larger or more diverse the database, the more complex and time intensive the problem of detecting and removing duplicates.