This application generally relates to data processing techniques as performed in computer systems, and more specifically to data processing techniques for data integration and updates of databases in computer systems.
Generally, in a computer system, databases and other storehouses of information may need to be updated. At the database record level, such updates may be translated into one of three operations: inserting a new record, deleting an existing record, or updating an existing record. A general problem arises as to techniques for determining which records are subject to which operations. In determining which operations to perform, a determination generally must be made as to which records are considered as matching or equivalent. One technique may consider two entries as xe2x80x9cmatchingxe2x80x9d if there is an exact character match of a record included in the update list, as well as one in the database. For example, an exact match of a name, address and phone number may indicate a matching entry. Problems with this technique are that two records may in fact represent the same information or logical entity and should be considered as xe2x80x9cmatchingxe2x80x9d. However, there may be typographical errors or other semantic equivalents of information stored in the records which result in a matching failure when a character-by-character comparison, as just described, is performed. For example, a middle initial may be omitted from a person""s name in one entry. In another update entry, the middle initial may be included. Although these may technically match and identify the same person, a character-by- character comparison would fail to identify these as matching records.
Another problem when considering which records are equivalent relates to the fact that update data may come from different sources. For example, if an existing record and the update records have the same source, a common set of unique identifiers may distinguish each record and used to detect matching entries. However, when the source of the existing database and the update records differ, special matching techniques are required to determine equivalent records between an existing database and update records.
Thus there is required a technique which efficiently updates an existing database by using various techniques to determine semantic equivalents of various record entries which should be considered as matching. Further, various data processing techniques are needed to xe2x80x9cclean-upxe2x80x9d data to be integrated into an existing database by eliminating these duplicates and incorporating semantic equivalents as appropriate.
In accordance with principles of the invention is a method executed in a computer system for performing data integration. For each update record, a determination is made regarding a transaction classification with regard to a working database. Transactions are applied to an unfiltered version of the working database in which the unfiltered database includes one or more records having unfiltered data. For each of said transactions, data enhancements are performed to an update record corresponding to each transaction producing a filtered record if the update record corresponding to each transaction is an update or an insert transaction. One or more filtered records is integrated into the working database. Post-processing is performed upon portions of the working database.
Thus, there is provided a technique which efficiently updates an existing database by using various techniques to determine semantic equivalents of various record entries which should be considered as matching, and performing data enhancements to the contents of the working database.