The present invention relates to computerized data and retrieval, and more particularly to techniques for determining whether stored data items should be linked or merged. More specifically, the present invention relates to making use of maximum entropy modeling to determine the probability that two different computer database records relate to the same person, entity, and/or transaction.
Computers keep and store information about each of us in databases. For example, a computer may maintain a list of a company""s customers in a customer database. When the company does business with a new customer, the customer""s name, address and telephone number is added to the database. The information in the database is then used for keeping track of the customer""s orders, sending out bills and newsletters to the customer, and the like.
Maintaining large databases can be difficult, time consuming and expensive. Duplicate records create an especially troublesome problem. Suppose for example that when a customer named xe2x80x9cJoseph Smithxe2x80x9d first starts doing business with an organization, his name is initially inputted into the computer database as xe2x80x9cJoe Smithxe2x80x9d. The next time he places an order, however, the sales clerk fails to notice or recognize that he is the same xe2x80x9cJoe Smithxe2x80x9d who is already in the database, and creates a new record under the name xe2x80x9cJoseph Smithxe2x80x9d. A still further transaction might result in a still further record under the name xe2x80x9cJ. Smith.xe2x80x9d When the company sends out a mass mailing to all of its customers, Mr. Smith will receive three copiesxe2x80x94one to xe2x80x9cJoe Smithxe2x80x9d, another addressed to xe2x80x9cJoseph Smithxe2x80x9d, and a third to xe2x80x9cJ. Smith.xe2x80x9d Mr. Smith may be annoyed at receiving several duplicate copies of the mailing, and the business has wasted money by needlessly printing and mailing duplicate copies.
It is possible to program a computer to eliminate records that are exact duplicates. However, in the example above, the records are not exact duplicates, but instead differ in certain respects. It is difficult for the computer to automatically determine whether the records are indeed duplicates. For example, the record for xe2x80x9cJ. Smithxe2x80x9d might correspond to Joe Smith, or it might correspond to Joe""s teenage daughter Jane Smith living at the same address. Jane Smith will never get her copy of the mailing if the computer is programmed to simply delete all but one xe2x80x9cJ_Smith.xe2x80x9d Data entry errors such as misspellings can cause even worse duplicate detection problems.
There are other situations in which different computer records need to be linked or matched up. For example, suppose that Mr. Smith has an automobile accident and files an insurance claim under his full name xe2x80x9cJoseph Smith.xe2x80x9d Suppose he later files a second claim for another accident under the name xe2x80x9cJ. R. Smith.xe2x80x9d It would be helpful if a computer could automatically match up the two different claims recordsxe2x80x94helping to speed processing of the second claim, and also ensuring that Mr. Smith is not fraudulently attempting to get double recovery for the same accident.
Another significant database management problem relates to merging two databases into one. Suppose one company merges with another company and now wants to create a master customer database by merging together existing databases from each company. It may be that some customers of the first company were also customers of the second company. Some mechanism should be used to recognize that two records with common names or other data are actually for the same person or entity.
As illustrated above, records that are related to one another are not always identical. Due to inconsistencies in data entry or for other reasons, two records for the same person or transaction may actually appear to be quite different (e.g., xe2x80x9cJoseph Braunxe2x80x9d and xe2x80x9cJoe Brownxe2x80x9d may actually be the same person). Moreover, records that may appear to be nearly identical may actually be for entirely different people and/or transactions (e.g., Joe Smith and his daughter Jane). A computer programmed to simply look for near or exact identity will fail to recognize records that should be linked, and may try to link records that should not be linked.
One way to solve these problems is to have human analysts review and compare records and make decisions as to which records match and which ones don""t. This is an extremely time-consuming and labor-intensive process, but in critical applications (e.g., the health professions) where errors cannot be tolerated, the high error rates of existing automatic techniques have been generally unacceptable. Therefore, further improvements are possible.
The present invention solves this problem by providing a method of training a system from examples that is capable of achieving very high accuracy by finding the optimal weighting of the different clues indicating whether two records should be matched or linked. The trained system provides three possible outputs when presented with two records: xe2x80x9cyesxe2x80x9d (i.e., the two records match and should be linked or merged); xe2x80x9cnoxe2x80x9d (i.e., the two records do not match and should not be linked or merged); or xe2x80x9cI don""t knowxe2x80x9d (human intervention and decision making is required). Registry management can make informed effort versus accuracy judgments, and the system can be easily tuned for peculiarities in each database to improve accuracy.
In more detail, the present invention uses a statistical technique known as xe2x80x9cmaximum entropy modelingxe2x80x9d to determine whether two records should be linked or matched. Briefly, given a set of pairs of records, which each have been marked with a reasonably reliable xe2x80x9clinkxe2x80x9d or xe2x80x9cnon-linkxe2x80x9d decision (the training data), the technique provided in accordance with the present invention builds a model using xe2x80x9cMaximum Entropy Modelingxe2x80x9d (or a similar technique) which will return, for a new pair of records, the probability that those two records should be linked. A high probability of linkage indicates that the pair should be linked. A low probability indicates that the pair should not be linked. Intermediate probabilities (i.e. pairs with probabilities close to 0.5) can be held for human review.
In still more detail, the present invention provides a process for linking records in one or more databases whereby a predictive model is constructed by training said model using some machine learning method on a corpus of record pairs which have been marked by one or more persons with a decision as to that person""s degree of certainty that the record pair should be linked. The predictive model may then be used to predict whether a further pair of records should be linked.
In accordance with another aspect of the invention, a process for linking records in one or more databases uses different factors to predict a link or non-link decision. These different factors are each assigned a weight. The equation Probability=L/(L+N) is formed, where L is the product of all features indicating link, and N is the product of all features indicating no-link. The calculated link probability is used to decide whether or not the records should be linked.
In accordance with a further aspect provided by the invention, the predictive model for record linkage is constructed using the maximum entropy modeling technique and/or a machine learning technique.
In accordance with a further aspect provided by the invention, a computer system can automatically take action based on the link/no-link decision. For example, the two or more records can automatically be merged or linked together; or an informational display can be presented to a data entry person about to create a new record in the database.
The techniques provided in accordance with the present invention have potential applications in a wide variety of record linkage, matching and/or merging tasks, including for example:
Removal of duplicate records from an existing database (xe2x80x9cDe-duplicationxe2x80x9d) such as by generating possible matches with database queries looking for matches on fields like first name, last name and/or birthday;
Fraud detection through the identification of health-care or governmental claims which appear to be submitted twice (the same individual receiving two Welfare checks or two claims being submitted for the same medical service);
The facilitation of the merging of multiple databases by identifying common records in the databases;
Techniques for linking records which do not indicate the same entity (for instance, linking mothers and daughters in health-care records for purposes of a health-care study); and
Accelerating data entry (e.g., automatic analysis at time of data entry to return the existing record most likely to match the new entryxe2x80x94thus reducing the potential for duplicate entries before they are inputted, and saving data entry time by automatically calling up a likely matching record that is already in the system).