Aspects of the disclosure relate to computer hardware and software. In particular, one or more aspects of the disclosure generally relate to calculating hash sums on databases implemented using parallel system architectures.
In modern commerce, customer information is maintained in a diverse array of database formats. Certain database formats may be preferred for initial intake and processing of customer data, while other formats may be more suitable for long term storage of the customer data. Other formats still may be better suited for analyzing the customer data. With very large databases, it may be preferable to utilize a database having a parallel system architecture. Such databases may include a plurality of nodes that each store a portion of the database. Many such parallel system architectures exist, with commercial options including parallel file systems such as Apache Hadoop and parallel databases such as Teradata Database.
It is frequently necessary to copy data from one database to another. For example, data may be transferred from an originating source or database to a data warehousing database in a process known as Extract, Transform, and Load (ETL). It is desirable to be able to confirm that the content of one database was accurately transferred to the other database. Best practices and/or regulations may require that operators confirm that data loaded into a data warehouse matches the data that was received from a customer. Some techniques for comparing databases include comparing check sums, record counts, byte counts, or column sums. However, each of these techniques fails to guarantee that the content of two databases is byte-for-byte the same. An additional problem is presented by disparate types of parallel systems, as existing techniques may require that the data in one parallel system be transferred to a single node, sorted, and processed further before a comparison can be made. This approach has numerous disadvantages, as it introduces a bottleneck at the single node and discards the many advantages provided by parallel systems.