1. Field of the Invention
The present invention relates generally to database management systems and more particularly to techniques for doing automatic failover from a primary database server to a standby database server.
2. Description of Related Art
As government and business store increasing amounts of data in database systems, there are increasing demands to have such data always available, even in the face of catastrophic failure of computer hardware, network outage, disastrous data corruption, etc. To meet these requirements database system engineers have developed database system configurations in which database data is replicated in more than one database system. Once data is replicated from one database system to another, if the first database system becomes absent from the configuration, the second database is used for processing database requests. The term absent is used here for any situation in which other participants in a configuration lose contact with a particular participant. Absence may be caused, for example, by failure of the absent participant or by failure of communications links between the absent participant and other participants. The process of switching from an absent first database system to a second database system is commonly known as failover.
Replicating a Database in a Standby Database
Replication features such as those just described are available under the name Oracle Data Guard in relational database systems manufactured by Oracle Corporation of Redwood City, Calif.
FIG. 1 shows a database system that uses Data Guard to replicate data to multiple standby databases across a network. Replicated database system 101 contains primary database 103 and two standby databases 113 and 121. Primary database 103 contains database information including database tables and metadata. Updates made to the primary database 103 are transmitted via network 105 to replication system 108, which replicates the updates in database 113 and/or to replication system 110, which replicates the updates in database 121. In both replication systems, what is transmitted via network 105 is updates in the form of redo data 107. The redo data is then stored in redo log files 109. Redo log files 109 are files that contain redo data records. Redo data records record data that the database system can use to reconstruct all changes made to the primary database 103, including changes that have not yet been committed (made permanent). For example, if a balance value in a bank_balance table changes, the database system generates a redo data record containing a change vector that describes the change to the database. When the redo data is used to recover the database system, the database system reads the change vectors in the redo data records and applies the changes recorded in the vectors to the database.
The redo data may be applied either physically or logically against a standby database. Redo data is a physical copy of the data produced in primary database 103 as a result of the change. When redo data is applied physically against a standby database, as shown at 111 and 113, standby database 113 is physically identical to primary database 103, that is, it has data structures which are identical on a disk block by disk block basis to those in primary database 103 and the redo data is applied as it comes from primary database 103 to database 113.
When redo data is applied logically against a standby database, as shown at 115-121, standby database 121 is logically identical to primary database 103, that is, an SQL statement will have the same result when applied either to primary database 103 or logical standby database 121. When redo data is applied logically, the redo data is transformed into the SQL statements that produced the changes recorded in the redo data, as shown at 115 and 117, and the SQL statements are then executed on logical standby database 121, as shown at 119.
An Oracle database system 101 using Data Guard can be run in three distinct protection modes:
Maximum Protection                This mode offers the highest level of data protection. Redo data 107 is synchronously transmitted (SYNC) to standby database system 108 or 110 from the primary database 103, and transactions are not committed on primary database 103 until the standby database indicates to the primary database that it has the redo data. When no standby database can do this, the primary database must stop processing. As long as the primary database system is processing data in maximum protection mode, there will be no loss of redo data.        
Maximum Availability                This also guarantees no loss of redo data at least so long as primary database 103 and standby database 113 or 121 remain synchronized with each other with respect to the redo data that is available to each. However, if standby database system 108 or 110 becomes absent, processing continues on primary database 103. Thus the primary and that standby are no longer synchronized with each other—the primary has generated redo data that is not yet available to the standby. When the fault is corrected, standby database 113 or 121 is resynchronized with primary database 103. If a failover occurs before the standby database is resynchronized with the primary database, some data may be lost.        
Maximum Performance                This mode offers slightly less data protection to primary database 103, but higher potential performance for the primary than does the maximum availability mode. In this mode, as primary database 103 processes transactions, redo data 107 is asynchronously transmitted (ASYNC) to standby database system 108 or 110. The commit operation on primary database 103 does not wait for standby database system 108 or 110 to acknowledge receipt of redo data 107 before completing write operations on primary database 103. If any standby destination 113 or 121 becomes absent, processing continues unabated on primary database 103. There is little impact on primary database 103 performance due either to the overhead of asynchronously transmitting redo data or to the loss of the standby.Automatic Failover        
If the primary database system and the standby database system are synchronized with each other and the primary database system becomes absent, an automatic failover may occur. In the automatic failover, the standby database becomes the primary database and when the former primary database has recovered, the former primary may become the new standby. FIG. 2 presents a schematic overview of how automatic failover works.
An exemplary implementation of a database system employing automatic failover was disclosed by Microsoft Corporation in 2004. The following schematic is based on that implementation. A normally functioning replicated database system is shown at 203. The replicated database system 203 has a primary database 202 and standby database 211. In the Microsoft Corporation implementation, both the primary and standby databases run on SQL servers. Additionally, the replicated database system includes a witness or observer 209. The witness or observer and the two database systems are in contact with and monitor each other, as shown by arrow 213. In the Microsoft Corporation system, the witness or observer is another SQL server; the server need not, however, have a database system mounted on it. In the following, the primary database system, standby database system, and the witness or observer are termed the failover participants.
The function of the witness or observer (in the following simply “Observer”) in the Microsoft Corporation implementation of automatic failover and in such implementations generally is to provide an entity in addition to the primary and standby databases which can help the standby or primary database determine either whether a role change has already occurred or whether a role change is now necessary. For example, both the standby and the Observer monitor the primary database system, and if the primary database system becomes absent, the standby database system may have to perform an automatic failover. The standby database system will not, however, perform the automatic failover unless the Observer has confirmed that the primary is absent. In general terms, the process of one participant in the configuration obtaining confirmation from another participant in the configuration before changing the current state of the configuration is termed obtaining a quorum for the state change. Thus, in general terms, the function of the observer is to make it possible for either the primary or the standby to obtain a quorum for a state change when the other is not available.
When the replicated database system is functioning as shown at 203, primary database 202 is forwarding redo data 215 to redo log files 109 and the redo data is being applied to standby database 211 (arrow 215). During normal functioning of the replicated database as shown at 203, primary database 202 fails. At 205 is shown how the replicated system fails over from failed primary database 202 to standby or failover target database 211. Because database 202 has failed such that Observer 209 no longer is in communication with database 202, Observer 209 is in communication only with database 211, as shown by arrow 217 and database 202 has ceased sending database 211 redo data. If Observer 209 has also noted that database 202 has failed, there is a quorum for automatic failover and standby database 211 can perform the failover. Upon failover, applications that would be attached to failed primary database 202 are re-attached to the new primary database 211 instead. Modifications to the new primary database 211 are stored in redo log files in the usual fashion. At 207 is shown what happens when Observer 209 notes that database 202 has become available again. Observer 209 now has communication with both database systems, as shown by arrow 213(iii). Working together, new primary server 211 and Observer 209 recover failed primary 202 such that it may serve the new primary as its standby server. At this point, database 211 is the primary database and database 202 the standby database. Redo data 219 flows from database 211 to database 202, as shown by arrow 219.
A serious concern in the design of database systems that do automatic failover is ensuring that the automatic failover does not result in divergence between the primary and standby databases. The databases have diverged when there are differences between the databases which cannot be reconciled without the loss of data in one or the other of the databases. There are two situations in which failover may result in diverging databases:    1. At the time of the failover, some of the redo data generated by the absent primary prior to its absence has not reached the standby; or    2. the failover has caused the former standby to become the primary and the absent primary does not realize that the failover has occurred and again begins to generate redo data. This situation, in which two primary database systems are generating different streams of redo data, is termed the split brain syndrome.
In the Microsoft automatic failover system of FIG. 2, divergence resulting from automatic failover is prevented by having the primary cease processing transactions whenever no quorum is available, i.e., whenever both the standby and the witness are unavailable. The primary ceases processing transactions even though it is perfectly capable of continuing to process them, albeit at the risk of some loss of redo data because the redo being produced by the primary cannot be immediately sent to the standby. In the following, a primary which ceases processing transactions in order to prevent divergence is said to have stalled. As can be seen from the foregoing, there is a tradeoff in systems with automatic failover between divergence prevention and availability of the primary database system.
While automatic failover is conceptually simple, there are many difficulties at the detailed design level. Among them are:                designing a system with automatic failover such that divergence is prevented and availability of the primary is maximized.        managing automatic failover so that divergence cannot occur.        managing state changes generally in the system so that divergence cannot occur.        minimizing the resources required for the observer.        propagating the current configuration state among the failover participants.        
It is an object of the invention disclosed herein to provide solutions for these and other problems in the design of replicating database systems that perform automatic failover.