A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent disclosure, as it appears in the Patent and Trademark Office public patent files or records, but otherwise reserves all copyright rights whatsoever.
This disclosure incorporates by reference the material on the compact disk labeled CD-ROM I constitutes of two copies (Copy 1 and Copy 2), each having the following files: source.code.appendix.09_345825.txt (83,523 bytes) and source.code.appendix.09_345825.wpd (156,003 bytes), the first a text file format the second a WordPerfect file format; in an IBM PC FORMAT for use in a MS WINDOWS environment. This material is referred to herein as the Computer Program Listing Appendix.
1. Field of the Invention
The present invention relates to computer-based technology for linking or matching records in data files. In many cases it is important to link or match different records pertaining to the same individual. The matching and linking method and system of the present invention can operate as a totally automated matching system with improved match rates while reducing the number of false matches. Matching methods are used in many areas of education, business and commerce. In today""s information age, it is important to be able to efficiently and accurately match data from a variety of sources while allowing various levels of accuracy.
2. Description of Related Art
Many existing matching methods and systems use what is called a weighting scheme, where points are awarded if some identification information on two data files matches. One type of weighting system might award the following points for various identification fields:
There are two major deficiencies with using this type of scheme. Weighting schemes are not based on sound, statistically defensible criteria. For example, the inventors are unaware of any proof that social security number is twice as important as last name when matching data across multiple files. The second deficiency is that the weighting scheme does not look at combinations identification information or the interaction of the identification variables. Moreover, basic probability theory tells us that adding together the weights of fields that match tends to over-estimate the likelihood of a true match.
One preferred embodiment of the present invention uses a multi-stage probabilistic approach to matching students across program files. This multi-stage approach allows us to use different matching or linking criteria to produce potential matched pairs of student information for later evaluation. The first stage uses student social security number as a basis for matching students. Students not being matched in the first stage are reevaluated in the next stage. One preferred embodiment of the present invention has a second stage of matching that uses a combination of last name and first name as a basis to match students and search for additional student matches across the two files. Once a potential match is found, the likelihood that it is a true match or link is evaluated using a probabilistic model. Additional stages, based on other identification fields, can be added in an iterative manner.
A Bayesian approach was used to develop appropriate probability models. In one preferred embodiment of the present invention seven identification fields (identifiers) were used in determining the probability that a matched pair of records is indeed the same student. Those fields are last name, first name, middle initial, social security number, and date of birth, zip code and gender. Based on a national sample of overlapping students from two sources we determined the probability that students who are the same have information that matches and also the probability that their identification information does not match. Then we used two national samples that do not contain overlapping students to determine the probabilities that students who are not the same will have matching identification fields.
When a potential match is found, these base probabilities are used to calculate the conditional probability that the matched records are the same students. Many times multiple matches will occur using a given identification string. For example we may find 3 Jane Smiths in file 1 on 2 Jane Smiths in file 2. When this occurs we calculate probabilities on all possible pairs of matches and then use the highest probability pair. All matched pairs of records must have a probability above a certain threshold to be considered a match.
By adjusting this threshold level we can increase our matching rate, at the expense of more false matches or decrease the matching rate to get a cleaner matched sample. In our trials we tried numerous threshold levels and evaluated the matched pairs that passed the threshold test for accuracy in matching. We also evaluated the matched pairs that failed the threshold test to see if we were inadvertently excluding students who were obvious matches.
This methodology is a great improvement because adjustments to the model are easy to implement and are statistically defensible. Matching different populations of people would only require adjustments to the program parameters, not the methodology or software. This is a great plus. This parameterization allows any 2 populations regardless of program area or content to be matched with just inputted parameters. As with ACES, research analysis was done to calculate the initial Bayesian statistics, such analysis would need to be done to create those initial numbers prior to matching. This method needs not resolution or human intervention. It is PC-based, which helps keep costs down. Preliminary cleaning of data was also found to enhance the match.
Although preferred embodiments of the present invention are described below in detail, it is desired to emphasize that this is for the purpose of illustrating and describing the invention, and should not be considered as necessarily limiting the invention, it being understood that many modifications can be made by those skilled in the art while still practicing the invention claimed herein.