Record linkage for a database is the problem of finding pairs or sets of records that represent the same entity. For a large database that does not fit entirely into a random access memory, comparison of all possible pairs of records involves many database readings to bring data records that need to be compared into the memory. This can be an inefficient and time-consuming operation.
In previously considered techniques, each database reading would load those records into memory that were to be compared, such as those records that had the same blocking key value. There are several disadvantages of such methods. One disadvantage is that the number of such blocks is large and therefore the number of required database readings is great. Another disadvantage is that block sizes can vary in a wide range. For small blocks, this method leads to the waste of memory resources. For blocks that are too large, it leads to out-of-memory errors.
Accordingly, it is desirable to optimize database access for record linkage.