This application generally relates to data processing techniques as performed in computer systems, and more specifically to data processing techniques for data integration and updates of databases in computer systems.
Generally, in a computer system, databases and other storehouses of information may need to be updated. At the database record level, such updates may be translated into one of three operations: inserting a new record, deleting an existing record, or updating an existing record. A general problem arises as to techniques for determining which records are subject to which operations. In determining which operations to perform, a determination generally must be made as to which records are considered as matching or equivalent. One technique may consider two entries as xe2x80x9cmatchingxe2x80x9d if there is an exact character match of a record included in the update list, as well as one in the database. For example, an exact match of a name, address and phone number may indicate a matching entry. Problems with this technique are that two records may in fact represent the same information or logical entity and should be considered as xe2x80x9cmatchingxe2x80x9d. However, there may be typographical errors or other semantic equivalents of information stored in the records which result in a matching failure when a character-by-character comparison, as just described, is performed. For example, a middle initial may be omitted from a person""s name in one entry. In another update entry, the middle initial may be included. Although these may technically match and identify the same person, a character-by-character comparison would fail to identify these as matching records.
Another problem when considering which records are equivalent relates to the fact that update data may come from different sources. For example, if an existing record and the update records have the same source, a common set of unique identifiers may distinguish each record and used to detect matching entries. However, when the source of the existing database and the update records differ, special matching techniques are required to determine equivalent records between an existing database and update records.
Thus there is required a technique which efficiently updates an existing database by using various techniques to determine semantic equivalents of various record entries which should be considered as matching. Further, various data processing techniques are needed to xe2x80x9cclean-upxe2x80x9d data to be integrated into an existing database by eliminating these duplicates and incorporating semantic equivalents as appropriate.
In accordance with principles of the invention is a method executed in a computer system for determining if a data update entry has a matching entry in an existing database. It is determined if an update entry includes a phone number that is toll-free. If the update entry includes a phone number that is toll-free, the method further includes: determining a subset of one or more existing entries in the existing database with a matching phone number; for each existing entry in the subset, calculating an associated score in accordance with the strength of the name match between the update entry and each existing entry; for each existing entry in the subset, updating the associated score if a zip code match between each existing entry and the update entry is determined; determining if there is at least one associated score greater than a predetermined threshold; and if there is only one existing entry in the subset with an associated score greater than the predetermined threshold, determining this existing entry matches the update entry.
Thus, there is provided a technique which efficiently updates an existing database by using various techniques to determine semantic equivalents of various record entries which should be considered as matching.