1. Field of the Invention
The present invention relates generally to database management and, more particularly, to a system and method for identifying and recovering from database errors.
2. Related Art
Recently, more and more users who rely on critical data stored in a distributed and replicated database are beginning to require that such data be highly available. That is, the database must be available continuously regardless of hardware failures, network failures, software failures, or scheduled software and hardware maintenance. Generally only infrequent short periods of unavailability, totaling a few minutes to perhaps a few hours per year, are tolerable.
Unfortunately, a number of events can and often occur which prevent conventional databases from being highly available. For instance, database errors due to programming mistakes, faulty database administration procedures and improper user input result in conventional databases being unavailable. For example, an application or process may erroneously write data into an inappropriate area of the storage medium where the database is stored. An error may also result from synchronization procedures that reconcile data between different file servers using communication pathways which introduce errors into the data which is being passed between replicas of the database. There are also errors which are generally non-deterministic. Such errors include, for example, resource exhaustion, hardware failures in, for example, the database server, as well as a corrupted database or corrupted data structures within a database server process.
Typically, the occurrence of database errors is initially brought to the administrator's attention when there is inconsistent behavior or degradation of database performance. When this occurs, the database administrator may implement a number of conventional approaches to recover from the database error and return the damaged database to some previous state.
Many conventional database systems have provided transaction-based recovery techniques to recover from database errors. This enables a database to be restored to a consistent state after hardware or software failures that do not corrupt on-disk data. With archived transaction logs and periodic dumps, the database system can also recover from media failures or data corruptions. Although no committed transactions are lost, transaction-based recovery often prevents the database system from being highly available. Each period of unavailability may range from minutes to hours or even days depending upon the cause.
Some database management systems include log-based replication tools that replicate data in one database into one or more replicas. This is accomplished by reading committed transactions from a transaction log of the database in which the update is made, and performing the same updates in all of the replicas in the network. Depending upon the vendor and configuration options, the updates are either always made at a primary site and then propagated to the replicas, or the updates are made at any site and propagated to all other sites. To achieve high availability, these systems often employ a primary-standby primary scheme, where the standby primary is a replica that becomes the primary in the event of the failure of the original primary. There are, however, several problems with such an approach.
One problem is that the primary and the replica are only loosely synchronized. The state of the replica always lags the state of the primary by some unpredictable number of committed transactions. In the event of a primary failure or network failure, this certain number of prior committed primary transactions will be located only at the primary and unavailable to the standby primary. When the standby primary becomes accessible, it will be lack these transactions. Accordingly, such transactions will not be visible to users connected to the standby primary. When the old primary comes back on-line, it is generally brought up to date by providing it with all of the updates that occurred while it was unavailable. However, the transactions located only at the original may be inconsistent with the standby primary state. There may be no indication of an inconsistency, or these inconsistencies may show up as errors in a log file produced during resynchronization, and subsequently are addressed by conflict resolution procedures defined by default rules or user-specified actions.
Generally, the administrator may also invoke a conventional database repair tool, such as DSRepair (DSRepair is a registered trademark of Novell, Inc., Provo, Utah). DSRepair locates damaged portions of the database and reports such damage to the administrator. However, this and other database repair tools do not assist the administrator in determining what specifically has been damaged, and are limited in the extent to which the database is repaired. For example, the DSRepair tool reports the identified errors and simply attempts to make the database replica operational. This may result in the DSRepair tool deleting what it determines to be erroneous data or it could involve repair operations being performed on the data. However, because there is only one replica available to the DSRepair tool, it cannot determine whether the database error has been completely repaired. In addition, if the data is deleted, then the repaired replica will not be consistent with all the other replicas of the distributed database.
The DSRepair tool also generates a text report identifying the implemented remedial action for review by the database administrator. The database administrator must determine, based on this report, whether the repair processes were sufficient to continue normal operation, or whether portions or all of the database needs to be replaced. This manual intervention results in the database being unavailable for potentially extended periods of time. Furthermore, the extent to which the database is repaired is dependent upon the administrator's understanding of the database elements being replaced, limiting the integrity of the database by the expertise of the network administrator.
Because of the common problem of being unable to accurately identify the errors in the database and the extent to which they have been corrected, an entire replica is often replaced with a presumably valid replica, such as the primary or master replica. However, as noted, there may be inconsistencies in the replicas. In addition, this process is time consuming, adversely affecting the availability of the database. Furthermore, there is no guarantee that the new database does not have errors located in other portions of the database, or that it will not destroy or damage any new or unreplicated data in the original database.
What is needed, therefore, is a system and method that accurately identifies database errors and which repairs the errors quickly and efficiently, with minimal administrator intervention and with minimal dependence on the administrator's knowledge.