Generally, blocking can be the first phase of record matching, as shown in FIG. 1. Blocking attempts to collect similar records. It takes the name ‘blocking’ because a group of similar records is called a ‘block’ of records.
Blocking is used to find approximately matching records in large data sets because the brute force approach of comparing each record with every other record would take too long (the complexity of this brute force approach grows as the square of the number of records in the data set).
FIG. 1 shows batch or “offline” blocking reading records from two data sources for the purpose of matching records in one data source against records in the other data source. As discussed below, batch blocking can also be run against a single data source to identify duplicate records in the source. Blocking outputs possibly matching sets of records, where the size of each set of records is limited by a configuration parameter. A more detailed, more computationally expensive matching process may then analyze all pairs of records within each set of possibly matching records. We call this matching process the Second Stage Matching Algorithm which we abbreviate “SSMA.”
Generally speaking, the goal of blocking is to find as many potentially matching records as possible while not returning so many potential matches that the speed of downstream data retrieval and scoring suffers. More formally, blocking minimizes the number of missed matches (false negatives) while limiting the number of potential matches it returns.
The technology herein provides new technologies for batch, or offline, blocking. These technologies take a set of records and generate sets or blocks of potentially matching records for the entire set. The blocks of potential matches are then passed to the SSMA to evaluate which records match.
Exemplary Non-Limiting Features:                Fully customizable for any data        Requires very little user customization: Just define what fields to use for blocking and run it        User can easily specify their preferred tradeoff between faster performance on the one hand and achieving a very low rate of missed matches on the other hand        Requires no special knowledge of the database        Works across subject-matter domains: people, companies, products, financial securities, etc.        Does not require a relational database. Works on, among others, flat file, XML, and relational database inputs        Can make use of multiple machines to speed processing        
Exemplary Non-Limiting Benefits                High Speed. Perform fast record matching between large databases or between a moderate size input dataset and a large database        Accuracy. Get results that will mimic an expert's decisions.        Flexibility. Match on any subject matter, including people, financial securities, companies, and products.        Auditability. Simple fundamental algorithm is easy to describe to clients, enabling transparency of the system's decisions.        Match any kind of data.        Build systems customized to particular matching needs.        Make optimum business decisions with more reliable and valid data.        Remove duplicate records from databases to assure high quality data. This provides benefits for public health registries; district, state, or federal K-12 school enrollment databases; communicable disease surveillance systems; voter registration roles; and many other applications        Link databases to facilitate data analysis. The has applications for marketing (link a database of business prospects with a database purchased from another company), counter-terrorism (link airline passengers with a list of possible terrorists), and many other fields        