The present invention related data replication, and particularly to methods and systems employed in database management systems for comparing data contained in a source database structure against a data replicated in a target database corresponding to the source database table, and identifying any differences.
Data replication is a common practice for an enterprise to ensure continuous data replication via data redundancy. From a technology prospective, there are disk based replication methodology as well as middle tier software based replication methodology. In terms of replication protocol, there are synchronous and asynchronous replication.
In asynchronous replication, the data is replicated after the transaction that originated the data changes is committed; hence it does not impact source site transaction performance. To validate if data is 100% replicated with accuracy, especially when asynchronous replication is applied, a data comparison utility is often used. For database, a comparison is performed to ensure that the data entries in source and (replication) target are consistent (matching) in terms of number of records for each key value and for each record column.
When the database table gets very large, and the two databases are physically remote from each other (e.g., at a distance), the comparison can be very expensive due to the cost of fetching data from tables, and sending the rows from one database to another for the compare. To reduce the amount of data transferred, a checksum of a record or multiple records may be transferred instead of the records themselves. Only when the check sum comparison fails to match, row comparison will be used.
Further, to improve the performance, parallel comparison can be used, and the check sums of a data block (multiple rows) are compared as validation. However parallel comparison only improves the elapse time of the comparison, it does not reduce the amount of work and hence can still be I/O and network intensive, as well as CPU consuming.
To many customers, it is highly desirable to reduce the cost of doing table difference comparison. This includes both the cost of the comparison, and the time spent in the comparison. In addition, the volume of data might be extremely large. In this case, the comparison must be such that resources do not become overwhelmed when this comparison takes place.