1. Technical Field
The present invention generally relates to distributed data processing methods. More particularly, the present invention relates to methods for auditing the integrity of data being replicated on high availability computer systems.
2. Related Art
Business processes have become so intertwined with information technology (IT) as to make them inseparable. The flow of accurate, real-time information is critical to the success of modern businesses, and its availability to users of all types is considered a significant advantage in highly competitive markets.
Earlier centralized computing environments processed modest amounts of batch input and typically produced reports aggregating small chunks of information into meaningful results. Processing was managed as a result of sequential input and output, i.e., single-threading, and was fairly small by contemporary standards. As the demand for information grew, there were increased demands on processing capabilities. The centralized computing environments evolved into central system complexes, paving the way for multiple processes running in parallel, i.e., multi-threading. Thereafter, rudimentary interactive processing through communication monitors was developed, ushering in the transaction-processing requirements associated with most businesses.
Demand for information access increased still further as users gained additional power in the processing and manipulation of data, leading to the client/server topology, browser based technology, and the like. Client/server describes the relationship between two computer applications in which one application, the “client,” makes a service request from another application, the “server,” which fulfills the request. The client/server relationship model can be implemented by applications in a single computer, such as where one program acts as a “client” and interfaces with the user, while another program acts as a “server” by providing data requested by the “client.” An example of such a configuration is the X-Windows system. The relationship model may be expanded to networked environments, where it defines an efficient way to interconnect applications and data distributed across distant locations.
While there are significant advantages in structuring data processing network systems according to the client/server model, one well-recognized concern is that the server is a single point of failure. Despite the improved reliability of individual hardware and software components, anything from a minor process failure to a system-wide crash results in interruptions to the data and services provided by the server, also known as downtime. Additionally, problems with the network interconnecting the client and the server could experience problems, leading to further downtime. With global commerce being conducted across multiple countries and time zones simultaneously over the Internet, there is an increasing need to maintain operations and maximize uptime of computer systems that support such commercial transactions. Accordingly, there is no time to backup or verify static data as has been done with traditional CRC technology. The elimination of such single point of failure system-wide is a key element of high availability computer systems. Multiple servers, which are also referred to in the art as “nodes,” were organized in clusters. A number of different clustering methodologies were developed, each offering varying degrees of high availability. In an Active/Active cluster, traffic intended for a failed node is either passed onto an existing node or load balanced across remaining nodes. An Active/Passive cluster provides a fully redundant instance of each node, with a passive node taking over the active node only upon failure of the active node. Further, an N+1 type cluster provides an extra single node that is brought online to take over the role of the failed node. These high availability systems require that all data stored on the production or primary node to be mirrored on the backup node. This ensures that any data stored on the primary node is available for retrieval on the backup node in case the primary node fails.
The simplest method is to periodically copy all of the data on the production node to the backup node. As will be appreciated, however, this is deficient for high-availability systems because there is a lag between the backup operations. For example, any modification to data before failure but after the last backup operation is lost, and any restored version from the backup node will not reflect the modifications. Furthermore, this method can require significant network bandwidth and data processing resources due to the potentially large volumes of data, and may decrease the life of the physical storage device. These problems can be alleviated to some extent with the use of an incremental backup or synchronization method where only those data files that have been changed since the previous backup are copied from the primary node to the backup node. Typically, when a file is modified, only a small portion of the file is actually changed from the previous version. While an incremental backup or synchronization can reduce network bandwidth and save storage space compared to a complete backup or synchronization, it is still inefficient in that a complete file is transferred even though it is possible that only a small portion of the file was actually modified.
As an improvement to incremental backups or synchronizations, there are backup processes that identify the differences between two versions of a file, and attempt to copy only those differences. This can further reduce network bandwidth and storage requirements because only portions of the file are transmitted between the primary node and the backup node. One deficiency with this backup method was the heavy processing power necessary for deriving the differences between the files, particularly with large files.
Such deficiencies are particularly evident in relational databases, since the database management system stores all of the records thereof in a single file. Database management systems such as DB/2 developed by IBM Corporation Armonk, N.Y. organize records into particular fields, tables, and databases (a set of multiple tables). Typically, separate files are not generated for each table or fields.
The aforementioned conventional backup and synchronization methods are insufficient for high availability database applications. Specifically, high volumes of data must be replicated, and modifications are constantly being made thereto. Thus, there is a possibility that at any given moment, the primary node and the backup node do not have identical data. An alternative to incremental backups and the like of individual files containing a representation of the data structures of the respective databases is a journaling system incorporated into the database. The journaling system generates a log of each operation upon the primary node database, such as changing the value of a record, deleting the record, adding a new record, and so forth, and transmits that log to the backup node database. The backup node processes this log, and performs the operation thereon, resulting in up-to-date, identical databases between the production node and the backup node. Thus, the backup node is ready to take over operations immediately upon failure of the primary node, and all data on the backup node is identical to the data on the failed primary node.
While replication of data from the primary node to the backup node in this fashion is generally reliable, there may be instances where errors are introduced during transmission, or where there are race conditions between the primary and backup nodes, otherwise known as collisions. Inadvertent or intentional user access to the backup node without updating the primary node is another source of data corruption. In response, Cyclic Redundancy Check (CRC) processes have been utilized to detect errors in the data transmitted to the backup node. However, the source of errors is not limited to those occurring during transmission, so there is a need in the art for a system to continuously monitor the integrity of all data stored on a backup node database, and to flag and/or repair any errors upon discovery. Furthermore, because such integrity checking operations are time and resource intensive, there is a need for monitoring the integrity of the primary node while it is fully operational, that is, while updates are being made to the primary node database that is being replicated on the backup node database. Additionally, there is a need for seamlessly incorporating data validity checking in any high availability replication environment.