1. Field of the Invention
The present invention relates to a computer program product, system, and method for grouping records in buckets distributed across nodes a distributed database system to perform comparison of the grouped records.
2. Description of the Related Art
To compare data records in a database to determine a relationship value of the records, the database server may have to pair wise compare each possible pair of records. For large scale databases, such a comparison operation, which is computationally expensive, may require a substantial amount of computing resources to calculate the results in a timely fashion.
Prior art includes a candidate selection technique where candidate records are preprocessed and analyzed in order to place each into 0-n bucket groups. Once the buckets have been identified, the records associated with each individual bucket are pair-wise compared against each other using a probabilistic matching algorithm to determine the match score for the pair. The data that is used during the detailed comparison step is referred to as the comparison data. In existing probabilistic matching systems, the candidate comparison processor accesses a centralized repository, like a database or file-system, to retrieve the candidate record comparison data for the records that belong to the bucket being processed, which creates a bottleneck at the repository.
There is a need in the art for improved techniques to cross compare large data sets.