As the world becomes more and more computerized, so-called databases storing computerized information have become more significant and important. Most corporate, government and other entities make extensive use of database technology. For example, databases are used to keep track of your drivers license and car registration information, your medical records, your financial and banking information, your telephone number and address, and much other information.
Some databases are huge. Maintaining large databases can be difficult, time consuming and expensive. Duplicate records create an especially troublesome problem. Suppose for example that when a customer named “Joseph Smith” first starts doing business with an organization, his name is initially inputted into the computer database as “Joe Smith”. The next time he places an order, however, the sales clerk fails to notice or recognize that he is the same “Joe Smith” who is already in the database, and creates a new record under the name “Joseph Smith”. A still further transaction might result in a still further record under the name “J. Smith.” When the company sends out a mass mailing to all of its customers, Mr. Smith will receive three copies—one to “Joe Smith”, another addressed to “Joseph Smith”, and a third to “J. Smith.” Mr. Smith may be annoyed at receiving several duplicate copies of the mailing, and the business has wasted money by needlessly printing and mailing duplicate copies.
It is possible to program a computer to eliminate records that are exact duplicates. However, in the example above, the records are not exact duplicates, but instead differ in certain respects. It is difficult for the computer to automatically determine whether the records are indeed duplicates. For example, the record for “J. Smith” might correspond to Joe Smith, or it might correspond to Joe's teenage daughter Jane Smith living at the same address. Jane Smith will never get her copy of the mailing if the computer is programmed to simply delete all but one “J_Smith.” Data entry errors such as misspellings can cause even worse duplicate detection problems.
There are other situations in which different computer records need to be linked or matched up. For example, suppose that Mr. Smith has an automobile accident and files an insurance claim under his full name “Joseph Smith.” Suppose he later files a second claim for another accident under the name “J. R. Smith.” It would be helpful if a computer could automatically match up the two different claims records—helping to speed processing of the second claim, and also ensuring that Mr. Smith is not fraudulently attempting to get double recovery for the same accident.
Large databases create special problems in terms of efficient computing. It can take minutes, hours or sometimes even days to perform complex processes on large databases. It is generally desirable to reduce the amount of time required to perform such processing. This places a premium on more efficient database processing techniques.
One way to increase the efficiency of database and other record matching processing is to introduce a so-called “blocking” step. In the field of approximate record matching, a “blocking” step generally refers to a fast matching algorithm primarily used as the first step of a larger record matching system. The goal of a blocking step is generally to find all possible matches to an input query record. It is not ordinarily to aim for precision in determining which record is the correct match. Blocking thus aims for maximum “recall” or “sensitivity”, possibly at the expense of achieving high “precision” or “specificity”. In effect, blocking is a sort of “is it in the ballpark?” test that can be used to narrow down the number of records that need to be processed by a higher precision but more computationally intensive (or even manual) subsequent matching test.
Blocking algorithms have important applications in the field of approximate record matching. For example, they can be used to identify database records that might represent the same physical entity. These records can then be manually reviewed, or they could be automatically declared a match if the user does not require great accuracy. Another application (perhaps the most commonly used one) is to use a blocking algorithm as a first stage of a more accurate and computationally expensive record matching process. In this instance, in the second stage, one may use some matching technique to determine which record in the database is the best match to the “query record” being sought in the database. The second-stage matching algorithm then generally tests every record returned by the blocking algorithm against the query record to see if they match.
The initial blocking step is very useful because even with an extremely fast matching algorithm, when de-duplicating a database of n records where n is of any magnitude, it would generally be time-prohibitive and ineffective for the system to attempt to examine all (n*(n−1))/2 pairs of records in the database. Record matching systems often use a preliminary “blocking” step to reduce the number of pairs of records that the second-stage matching algorithm (SSMA) has to examine.
Traditional blocking methods are generally based on an ad hoc judgment of the usefulness of different fields in a matching decision. For instance, a healthcare site might use Medicaid and medical record number matches as blocking characteristics—meaning that any records matching on those two fields would be passed on to the second-stage matching algorithm. Also commonly used are matches on birthday and first name, birthday and last name, birthday and Soundex code of last name, etc.
This traditional approach can work reasonably well, but its ad hoc nature places a limitation on the portability of any system built around it. It also has a problem of generating too many false negative responses (i.e., records that should have been linked, but were not). The quality of the blocking routine is important to the ability of the system to minimize the number of false negatives since pairs that are not seen as possible matches in the blocking phase will be missed even if the second-stage matching algorithm's decision-making engine would have assigned them a high probability, or score, of match. At the same time, the system has to carefully manage tradeoffs between false negatives and run-time performance. If the blocking algorithm is too liberal in passing along hypothetical matches, system run-time may exceed the user's tolerance.
We provide an automated blocking technique that can be used for example as a first step to find approximate matches in a database. Exemplary illustrative non-limiting implementations of the technique build a blocking set to be as liberal as possible in efficiently retrieving records that match on individual fields or sets of fields while avoiding selection criteria that are predicted to return more than the maximum number of records defining a particular special requirement. The ability to do blocking without extensive manual setup at low cost is highly advantageous including but not limited to situations where a machine learning based or other second-stage matching algorithm is being used.