Databases and data warehouses are computer-based data structures designed to allow the storing and querying of records which are typically received from one or more sources. The records generally correspond with entities, such as individuals, organizations and property. In certain cases, a database system is confronted with a situation wherein a new set of data may be substantially duplicative of a data set previously submitted to the system. Furthermore, the new data set may include a certain amount, even a small amount, of additions, modifications or deletions when compared with the previously submitted data set. Processing largely redundant sets of data misuses valuable system resources and presents significant scalability issues.
For example, a previously submitted data set may contain all the telephone residential listings of a particular geographic area. Thereafter, perhaps monthly or semiannually, the system may receive a new set of data that comprises a more recent set of either all or part of the telephone residential listings of the particular geographic area. Processing the new highly duplicative data set, at a minimum, will not identify records deleted from the more recent set and will require the intended recipient(s) to process substantially more data than necessary.
It is contemplated by the present invention that identifying or assigning a persistent key corresponding to each record could be used to facilitate efficient processing and identification of each record by the intended recipient(s) of the data set. For example, telephone residential listings do not contain a persistent key for each record. Therefore, any comparison in current systems is based upon the entire record or some combination of data in the record, such as last name, first name, telephone number and/or address. Occasionally, one record or many records in a data set may be different from a previously submitted data set, such as when a postal office splits a zip code. In such a case, a persistent key facilitates more efficient processing by the intended recipient(s) by enabling the intended recipient(s) to update the affected record(s) based upon the persistent key, thus minimizing the processing required to initially identify the affected record(s).
Unfortunately, current systems do not have an efficient way to compare two data sets and determine the additions, deletions or modifications between the two data sets while maintaining a persistent key. This includes, without limitation, an efficient way for generating a log representing a subset of such additions, deletions or modifications for further review, analysis and/or reporting with the respective persistent key.
The present invention is provided to address these and other issues.