Data reconciliation is a process by which two data sources share information about the data records that they have, so that they can identify missing or changed data records. Typically, one data store (i.e. a source system) is considered authoritative and the other data store (i.e. a destination system) is considered a copy.
One simple process for checking that all of the data records from a source system are present at a destination system involves assigning all of the data records a unique identifier (“ID”). A join operation may then be performed on a table of data records from the source system and on a table of data records from the destination system. Any IDs that are missing a joined row following the join operation are considered missing records.
The asymptotic time for performing a join operation is typically computationally expensive. Assuming, for example, that a source system has T1 data records and that a destination system has T2 data records. Joining two tables containing these records together by brute force takes T1*T2 time and T1+T2 memory. If the IDs are comparable, then sorting the records from the source system (i.e. T1 data records) and then sorting the data from the destination system (i.e. T2 data records) yields T1*log(T1)+T2*log(T2) time and still uses T1+T2 memory. Moreover, after agreeing on data completeness, another operation may be performed to assess data integrity, which takes linear time.
At a very large scale, previous approaches to data completeness and data integrity, such as the simple process described above, may be suboptimal because both T1 and T2 can be in the order of tens or even possibly hundreds of billions of data records. It is with respect to these and other considerations that the disclosure made herein is presented.