1. Field of the Disclosure
The present disclosure relates to a method and system for integrating data into a database and more particularly relates to a method and system for integrating data from multiple data sources into a database whilst minimising data duplication.
2. Background of the Disclosure
The volume of data stored in databases is growing exponentially, as is the rate at which the data becomes available. The data which is to be stored in databases is also becoming more complex since each record often comprises a large number of different attributes.
Data from multiple data sources often needs to be integrated into a central database. With large volumes of data, integration into a central database can result in the data in the central database being plagued with errors and anomalies, such as duplicate records. Duplicate records are database records that refer to the same real-world entity. Duplicate records have a negative impact on the effectiveness of data querying and analysis tasks. The result is poor data analysis efficiency and a higher cost to enterprises using the data.
It has been proposed previously to use de-duplication rules to de-duplicate data. The de-duplication rules are learnt from training data which is either passively collected before the learning process or actively collected during the learning process. Conventional methods which use such de-duplication rules are, however, limited and are not able to handle heterogeneous data representing different types of entities due to the diverse characteristics of each entity type.
A distributed duplicate elimination method has been proposed previously which parallelises the de-duplication process using the MapReduce model. However, the problem with this method is that it is incapable of operating on records which have a sparseness of data, a large number of attributes or heterogeneous attributes/entity types.
A further technique has been proposed previously in which duplicate detection is carried out using structured query language (SQL) queries that are processed using a database management system (DBMS). The problem with this technique is that it must rely on an index to record and retrieve similar records. Building and updating such an index is prohibitively slow for large data sets with thousands of attributes.
Techniques have been proposed previously in which records are partitioned into blocks that are either disjoint or overlapping. These techniques rely on either sorting or indexing the records, which are both expensive operations with a large time complexity with respect to the number of records.
There is a need for a method and system for integrating data from multiple data sources into a central database whilst minimising duplication of records in the central database.
The present disclosure seeks to provide an improved method and system for integrating data into a database.