1. Technical Field
This invention pertains to the field of transaction data comparison. Specifically, the present invention pertains to the identification of duplicate data records where the contents of the data records may or may not be a perfect match.
2. Description of the Prior Art
Many matching methods have been devised. Most matching methods rely on a comparison of each position in a data field so that a match is declared when each position in the field is equal. These prior art matching methods are suitable for identifying duplicate records where the records are identical.
In most data processing systems used in the world today, data is stored in databases and once that data is so stored, it is seldom, if ever, reviewed by human analysts. This is especially true in the fields of customer relationship management where marketing databases are used to maintain information about potential customers and in transaction processing, where records of transactions must be correlated to a known description of a customer. All too often, a new data source is to be added to a database that describes the same individual or transaction and that individual or transaction can be depicted by a plurality of records. The reason for this is that the records stored in the database may not be exactly identical. A record may be incomplete in that some of the fields in the record may not be blank. A record can also include aliases that may be equivalent, but are not necessarily identical. And in the matching mechanisms previously known, records that are not identical will not be identified as duplicates of each other.
The problem of identifying duplicates in a database is amplified by the fact that records in a database are normally comprised of a plurality of fields. In the prior art, each field in two records subject to comparison would need to be checked for xe2x80x9csamenessxe2x80x9d. If each field in two records were found to be identical, then the records could be flagged as duplicates of each other. This method would work except for the case where some fields in two records matched identically, but other fields may not match at all or they may contain aliases.
The present invention comprises a method and apparatus that allows for flexible comparison of a transaction record to a plurality of known data records. The present invention accepts transactions records from a computer application. The same computer application can specify a plurality of legacy records. These legacy records can exist, in for instance, a customer database. But the invention is not limited to use only in customer database applications.
Each of the legacy records is compared to the transaction record in sequence. Each of the comparisons is conducted on each record by examining each field in the transaction record and comparing the contents of the field to the contents of a corresponding record in one of the plurality of legacy records.
If a particular field in each of the two records being compared is found to be a match, then a positive accumulator is incremented. If the field does not match, then a negative accumulator is incremented. In the event that a field does not match, it must mismatch in order for the negative accumulator to be incremented. If a definitive mismatch does is not declared, then neither the positive nor the negative accumulators are incremented.
Once all of the fields in a record are compared, the positive accumulator is checked against a first positive threshold. Where the value of the positive accumulator exceeds the positive threshold, the records are deemed to be equivalents of each other. This means that a transaction record can be matched to a legacy record even though the two are not exactly identical.
If a particular record is not found to be an equivalent, then the value of the negative accumulator is compared to a second negative threshold. If the value of the negative accumulator exceeds that negative threshold, the two records are declared as non-equivalent.
A final check for equivalency is made by comparing the difference of the positive accumulator with the negative accumulator. If the difference between the positive and negative accumulators is greater than a third delta threshold, then the two records are declared to be equivalents of each other.
In all of these comparisons, the present invention supports a flexible means to program each of the thresholds, i.e. the positive threshold, the negative threshold and the delta threshold. The present invention also allows the increment value for each comparison to be programmed individually for each field in the records being compared. All of these programmable values are obtained from a configuration file.
The comparison of each field can be performed in a number of ways. Each of the comparison mechanisms employed returns either a positive match indication or a negative mismatch indication.
The present invention comprises a rudimentary comparison mechanism that tests for sameness where both fields are non-blank. A refinement of the equality mechanism returns positive if one or both records contain blank fields. Yet another comparison mechanism removes blanks and punctuation characters and then compresses both fields before checking for equality.
A means for testing the equality of a gender code also comprises the present invention. In this means, both fields under scrutiny must contain a gender code and that code must be the same in each record for a match to occur. The invention also comprises a gender mechanism that also tests a first name appended to the gender code to determine equivalence of two records.
The present invention further comprises a mechanism that tests for numeric equivalency. In this mechanism, both fields are converted into a numeric equivalent and then those equivalents are compared for sameness.
The present invention further comprises a comparison mechanism that removes blanks and special character before compressing both fields and then testing for equality. This mechanism only checks the two fields to the extent of the shortest of the two fields.
A close alpha comparison mechanism also comprises the present invention. This comparison mechanism will remove blanks, punctuation and numeric characters before compressing both fields and then comparing those fields. In this close alpha comparison mechanism, one transposition will not be fatal to the comparison and the letters xe2x80x9cExe2x80x9d and xe2x80x9cOxe2x80x9d are equal.
The present invention further comprises an alternative equality comparison mechanism where not only are equal fields found as positive matches, but any field compared to a blank fields will also return a positive match. Yet another comparison mechanism comprising the present invention is a compressed mode comparison in which all blanks and punctuation characters are removed and both fields compressed before comparing for equality.
In cases where a numeric comparison is required, the present invention comprises a single transposition tolerant equality mechanism. The invention also comprises a right justified numeric comparison that isolates numeric characters and then compares the right most numeric characters in each of two fields from two records being compared.
A very unique comparison mechanism that is tolerant of one character transposition and returns a match when the two fields from the two records match or the two record fields contain blanks also comprises the present invention. This mechanism will return a negative, mismatch condition if a non-blank field is compared to a blank field.
The present invention further comprises a not-equal comparison mechanism. The not-equal comparison mechanism will return a match condition where the fields being compared don""t match and a match indication where the fields do not match.
Yet another comparison mechanism comprising the present embodiment checks to determine if two records contain nick-names of each other. This mechanism returns a positive match where one record contains a name or an initial and the second record contains either that name or initial or a nickname for the name found in the first record. This mechanism is embodied in an enhanced version that includes a gender code appended to the name and/or initial.
The present invention further comprises a means to compare two corresponding fields from two records in a phonetic and in a reverse phonetic manner. The present invention also comprises an alphanumeric comparison mechanism that allows for one transposition error.