Embodiments of the present invention relate to information management, and more particularly relate to techniques for identifying duplicate records in data imported into a data repository, such as a data hub.
A data hub, or master data management (MDM) solution, is a collection of software and/or hardware components that enables a business to maintain a single, master source of information that is accessible across multiple, heterogeneous information management systems. Currently, software vendors offer a variety of different types of data hubs directed to different business areas or industries. For example, the Product Information Management Data Hub (PIMDH) developed by Oracle Corporation provides product development/manufacturing organizations a centralized view of their product-related data.
Since a data hub acts as a centralized, authoritative source of information, an important aspect of managing a data hub is maintaining the quality of the data stored therein. Accordingly, any data that is imported into a data hub should be appropriately “cleansed” so that it is valid, consistent, and accurate. Merely by way of example, consider a product management data hub (such as PIMDH) that is configured to store records for a plurality of different products/items. In some cases, records may be imported into the data hub (from, for instance, legacy and/or third-party systems) that duplicate some portion of the data already present in the hub. This results in duplicate or overlapping records per item. To maintain the consistency of the data stored in the hub, these duplicate records should be merged into a single, master record per item.
In current practice, the problem of duplicate records described above is generally managed in an ex post fashion. In other words, records from external systems are initially imported into the data hub, without regard to the existence of duplicate records in the hub. Once the records have been imported, the data hub is manually searched to identify potential duplicates. The potential duplicates are then exported from the data hub, manually merged, and then re-imported into the data hub as merged data.
However, this expost approach is problematic for several reasons. For example, the process of importing records, exporting potential duplicates, and then re-importing the merged data is inefficient and potentially very time-consuming. This will be particularly true if the number of records being imported (i.e., the size of the import batch) is large. Further, since uncleansed (e.g., duplicate-containing) data is initially imported into the production environment of the data hub, the users of the production environment (e.g., internal users, external partners, etc.) will see an inconsistent view of the data until the duplicates are removed/merged. This problem can be mitigated by bringing down the production environment while the imported records are searched, exported, merged, and re-imported. However, this obviously increases the downtime of the data hub during the import process. If records are imported on a regular basis, this increased downtime may by unacceptable.