1. Field of the Invention
The invention is related generally to database systems and is more particularly related to database systems which include a primary database system and a standby database system. As the primary database executes transactions, it sends data to the standby database system which permits the standby database system to construct a replica of the current state of the database in the primary database system. If the primary database system fails, the standby can immediately take over for it.
2. Description of Related Art
Overview of High-Availability, Low Data Loss Database Systems
Users of database systems have always been concerned with losing data, and consequently, many techniques have been developed for archiving the information stored in a database system and restoring the database system from the archive. As on-line transaction processing has become one of the most important uses of database systems, users have also become concerned with the loss of availability of a database system. A consequence of these concerns has been the development of database systems which include a primary database system and a standby database system. As the primary database system processes transactions, it not only updates its own database, but produces redo data for each transaction. The redo data describes the change made in the primary database system's database as a consequence of the transaction. As the redo data is produced, it is sent to the standby database system. The standby database system thus receives a stream of redo data from the primary database system. The standby database system is initially constructed from a copy of the primary's database. As the standby database system receives the redo data, it first stores it in persistent storage and then applies the redo data to its copy of the primary's database. If the primary database fails or otherwise has to be taken out of service, the standby database system can replace the primary database system almost immediately. All that is required is that the standby database system apply whatever redo data has not yet been applied to the standby database system's database.
Primary and Standby Data Base Systems Implemented Using Oracle Data Guard Software: FIG. 1
Oracle Corporation, of Redwood City, Calif., manufactures and sells software for making a database system that includes a primary database system and a standby database system. The software is sold under the Oracle® Data Guard name (henceforth “Data Guard”). Data Guard is described in detail in Oracle Data Guard, Concepts and Administration, 10g Release 2 (10.2), Oracle Corporation part number B14239-04, March 2006 (henceforth “Data Guard Concepts”), which is hereby incorporated by reference into the present patent application. Data Guard provides users of the primary database system with three levels of protection against data loss:                Maximum protection                    This mode offers the highest level of data protection. User sessions connected to the primary database system perform transactions which change the database in the primary database system. As the primary database system performs transactions, it generates redo data. As the redo is generated, it is stored in a persistent redo log in the primary database system and a copy of the redo data is synchronously transmitted (SYNC) to the standby database system from the primary database system. When the redo data is received in the standby database system, it is stored in a persistent redo log in the standby database system and then applied to the standby database system to keep the database in the standby system identical to the database in the primary system. When the primary database system has completed a transaction, copied the transaction to its own redo log, and has received an acknowledgment from the standby that the standby has copied the transaction to its redo log, the primary database system sends a commit acknowledgement to the user session that the transaction has been committed in the primary database system. This mode of operation guarantees no data loss because when the primary database system sends the commit acknowledgment to the user session, both the primary database system and the standby database system haven written the redo data for the transaction to persistent storage. When the primary and the standby are operating in the mode just described, they are said to be synchronized. There are two disadvantages of maximum protection mode:                            the fact that the primary must wait for an acknowledgment from the standby before it can send the commit acknowledgment constrains the rate at which the primary can perform transactions. The period between the time the primary sends the redo data for a transaction and receives the acknowledgment for the redo data from the standby is termed the commit latency; and                if the standby becomes unavailable, so that the primary ceases to receive acknowledgments from the standby, the primary must cease processing transactions. A primary that has ceased processing transactions for this reason is said to have stalled. The standby may become unavailable because the standby has failed or because the communications link between the primary and the standby has failed.                                                Maximum availability                    This mode is similar to the maximum protection mode, including the guarantee of no data loss at least so long as the primary database system and the standby database system remain synchronized with each other with respect to the redo data that is available to each. However, if the standby database system becomes unavailable, the primary continues to process transactions. When this occurs, the primary has generated redo data that has not been stored in persistent storage in the standby and the standby and the primary are thus no longer synchronized with each other. Before the standby and the primary can again be synchronized, the primary must provide all of the redo data that was generated by the primary while the standby was unavailable to the standby. If the primary database system fails before the standby database system is resynchronized with the primary database system, some data may be lost.            Thus, Maximum Protection and Maximum Availability are similar except that in the former protection mode, the primary stalls if it loses its synchronized standby. In contrast, Maximum Availability continues to generate redo even if the standby has become unavailable.                        Maximum Performance                    This mode offers slightly less data protection to the primary database system, but higher potential performance for the primary than maximum availability mode does. In this mode, as the primary database system processes transactions, the redo data is asynchronously transmitted (ASYNC) to the standby database system. The primary database system sends the commit acknowledgment to the session as soon as it has persisted the redo data for the transaction in the primary's redo log. At a later time, the redo data in the primary's redo log is copied to the standby. Consequently, the speed at which the primary processes transactions is not constrained by the commit latency and if the standby database system becomes unavailable, processing continues unabated on the primary database system.                        
FIG. 1 is a block diagram of a database system 101 that includes a primary database system 103 and a standby database system 121. The block diagram is a slight modification of FIG. 5-4 of Data Guard Concepts. Primary database system 103's database 107 receives a stream of transactions 105; as the transactions are processed, a stream of redo data is produced which is processed by logwriter (LGWR) process 109. LGWR process 109 sends the redo data both to online redo log files (ORL) 113 and via one or more network server processes (LNSn) 111 and Oracle Network 119 to standby database system 121. In primary database system 103, one or more archiving processes (ARCn) 115 archive the ORL redo data in archived redo log files 117 once the ORL is completed. There is thus a set of ORLs in archived redo log files 117 which contains the redo data belonging to the stream. An ORL is completed when there is no space left in it to hold additional redo, or upon an explicit user request (command) to switch ORL's. The ORL's that a LGWR is writing into is also referred to as a current ORL. An ORL remains current until it is completed.
In standby database system 121, the stream of redo data being written to the current primary ORL also goes to remote file server process (RFS) 123, which writes the redo data to standby redo log files (SRL) 125. After standby database system 121 has written an item of redo data to current SRL 125, it sends an acknowledgment 124 of the write to primary database system 103. SRLs work generally like ORLs; thus there is a current SRL corresponding to the current ORL to which the redo data currently being received from the primary is written. When the current ORL is completed, the primary sends a “completed” packet to the standby and when the RFS 123 receives the “completed” packet, it completes the current SRL. The arrival of a “completed” packet in the standby is one example of an end of redo event. An end of redo event is an event which indicates that the primary is no longer sending redo data to be written to the current SRL. Among the end of redo events are the receipt of a “completed” packet as just described and a failover command. The failover command indicates that the primary database system has become absent and that the standby database system is to become the new primary database system.
The completed SRL contains all of the redo data that was written to the completed ORL. When the current SRL is completed, an archiving process ARCn 127 archives it to archived redo log files 129 in the primary or standby respectively. Thus, as in the primary, there is a set of SRLs that contains the redo data belonging the stream. A real time apply process 131 in the standby applies the redo data in either a complete or current SRL 125 to standby database system 121's database 131. If the real time apply process falls behind in its redo application, and if the SRL has been archived, the real time apply process will apply the redo from the archived log, if necessary. If database 131 is a physically exact copy of database 107, the redo data may be applied in the form in which it is received from LGWR 109 (MRP/LSP); if database 131 is logically equivalent to database 107, (i.e., the effect of executing a given SQL statement on the two databases is the same), the redo data is translated into equivalent SQL statements. The SQL statements are then executed in database 131.
As shown in FIG. 1, the transfer of redo data from primary database system 103 to standby database system 121 is synchronous: when LGWR 109 transmits an item of redo data to standby database 121, it waits until it has received acknowledgment 124 from standby database 103 to provide commit 110 indicate to primary database system 103; only after LGWR 109 has so indicated does the user transaction that generated the redo get a commit acknowledgement. As previously described, the speed at which the primary database 103 can process transactions depends on the commit latency. The commit latency in turn depends on the latency of acknowledgment 124 That latency has three parts: two network latencies and a write latency. The two network latencies are the length of time it takes for the redo data to be transferred from LGWR 109 to RFS 123 and the length of time it takes for the acknowledgment to be transferred from RFS 123 to LGWR 109. The write latency is the length of time it takes RFS 123 to write the item of redo data to SRL 125. As network speeds have increased, the write latency has become the major component of the latency of acknowledgment 124. The latency of acknowledgment 124 is in turn the major component of the commit latency of primary database system 103.
Primary or Standby Database Systems that are RAC Database Systems: FIGS. 2 and 3
Oracle Corporation has developed a technique for implementing database systems that is termed real application clusters or RAC. A RAC database system is one in which a number of database servers are used to implement a single database system. RAC database systems are described in detail in Oracle® Database, Oracle Clusterware and Oracle Real Application Clusters, Installation Guide, 10g Release 2 (10.2) for Linux, Oracle part number B14203-05, December 2005. That publication is hereby incorporated by reference into the present patent application.
FIG. 2 is a block diagram 201 of a RAC database system 202 which is coupled via Internet 205 to a database administrator client 203 and a number of user clients 207(0 . . . n). RAC database system 201 includes a number of database servers 209(0 . . . k), each of which is connected to Internet 205. Each of servers 209 is termed an instance of RAC database system 202. The database servers 209 are coupled to each other by fast interconnect 211. Fast interconnect 211 makes it possible to store data that must be accessible to all database servers 209 in a shared cache 213. A second fast interconnect 213 connects database servers 209 to a set of shared disk drives 215(0 . . . n). These disk drives contain the actual data for the database system. To user clients 207, RAC database system 202 appears to be a single database system.
Data Guard permits RAC database systems to be used either as primary or standby database systems. FIG. 3 is a slightly modified version of FIG. D-2 from Data Guard Concepts. It shows an implementation 301 of Data Guard in which primary database 303 is implemented by two primary RAC instances 209(A) and (B) and standby database 305 is implemented by two standby RAC instances 209(C) and (D). The primary and standby are connected as before by Oracle net 119. In FIG. 3, the numbers within circles indicate local connections, and the numbers within boxes indicate remote connections. In a Real Application Clusters environment, logwriter processes 109 for all of the primary instances write redo data to a set of online redo log files 113 which is accessible to all of the primary instances. Both ORL's at the primary and SRL's at the standby are configured on the set of shared disk drives. Thus ORL's are accessible from all primary instances; Similarly, SRL's are accessible from all standby instances. The stream of redo data from each of the primary instances belongs to a separate redo thread for that instance.
Each ORL is associated with and tightly coupled to a particular thread—Another way to say this, is that, threads cannot reuse each other's assigned ORL's for writing their redo. There are minimally two ORL's for each thread at the primary. This allows the LGWR for a RAC database server to switch into a new ORL (new ORL becomes current) while the previous completed ORL is archived.
Similar to the ORL association with a particular thread, each SRL is tightly coupled to a particular primary thread—i.e. if an SRL has been used to receive redo for a particular primary thread, that SRL cannot be reused to receive redo for another primary thread. DataGuard recommends having one more SRL per thread than the number of ORL's that are configured for the thread. Thus, if there are 3 ORL's configured for a particular thread, then, there should be 4 SRL's configured for that thread at the standby.
LGWR 109 for each of the instances is aware of whether all of the other LGWRs 109 are receiving acknowledgments 124, as indicated by arrow 307. If any of the LGWRs 109 is not receiving acknowledgements 124, standby 305 is no longer synchronized with primary 303. More precisely, a RAC primary database system is synchronized when operating in Maximum Availability when all LGWRs for all the RAC database servers that are up and running have connectivity to the standby database and there is no gap in received redo for any of the threads in the RAC database.
When synchronized, if one RAC primary database server loses connectivity to the standby database, the LGWR on that server messages all the other LGWRs to drop their connections to the standby and stop shipping redo to the standby. When this occurs, the SRLs stop receiving redo from the primary RAC database servers, and the primary database becomes unsynchronized. At a later point in time, once all LGWRs have connectivity to the standby, and all gaps in the redo data have been resolved, the primary database again becomes synchronized.
Any standby instance can receive redo data from any primary instance; a standby instance that is receiving data from a primary is a receiving instance 209(C). All receiving instances write to a set of standby redo log files 125 that are accessible to each of the receiving instances (as explained above, SRL's are configured on the shared disk drives). However, the archived redo log files 129 must ultimately reside on disk devices accessible by the recovery instance 209(D). Transferring the standby database archived redo log files from the receiving instance to the recovery instance is achieved using the cross-instance archival operation. The standby database cross-instance archival operation requires use of standby redo log files 125 that are accessible to all of the standby database instances as the temporary repository of primary database archived redo log files. Using standby redo log files 125 not only improves standby database performance and reliability and allows for implementation of the Maximum Availability mode, but also allows the cross-instance archival operation to be performed on clusters that do not have a cluster file system. Note that a database administrator can also configure the location for archived redo logs (regardless of the particular standby instance from which they were archived) to be visible, or readable, from all instances. This can be done by archiving the completed SRLs on another shared disk, or shared file system that is shared by all the standby Instances—i.e. the file system path to a particular archived log is the same, and is accessible, from all instances. In such configurations, cross-instance archival operations from one standby instance to another are not required. A similar arrangement is possible at the primary, so that the primary instance from which archived logs are backed to tape (long-term storage) can access archived logs generated at any primary instance. Again, in such cases, cross-instance archival operations are not necessary at a primary that is a RAC database.
Details of Standby Redo Log 125: FIG. 4
An Oracle database system includes a system global area (SGA) which contains information that is global to the entire database system. In the case of a RAC database system, the SGA is stored in shared cache 213. Included in the SGA is information about the redo log files. Shown at 411 is a database system-provided view V$STANDBY_LOG 411 which contains the information maintained in the database system about standby redo log files 125. Each standby redo file has a group number, a thread number indicating the redo thread that the file belongs to, a sequence number which indicates the file's position in the set of redo files for the thread, the number of bytes in the file, the number of those bytes that are actually being used, whether the file has been written to an archived redo log file 129, the status of the file, the lowest system change number (SCN) in the file and the data stamp for that SCN, and the highest SCN in the file and the data stamp for that SCN. All of this information except the status information and the archived information will be the same in the completed standby redo file and the completed on-line redo log file 113 it corresponds to.
The packets that contain the redo data include the group number, thread number, and sequence number of the on line redo log file 113 the standby redo log file 125 corresponds to. The system change numbers are also contained in the packets. A system change number is a monotonically increasing value which identifies the redo data for a given change in the database system. Except when more than one thread is changing the same set of database records concurrently, SCNs are issued independently for each thread, that is, the SCNs for each thread increase monotonically, but there is no relationship between an SCN in one thread and an SCN in another.
If two RAC threads update the same data block (database records), the redo that describes the changes to the data block has to be ordered by SCNs. It is critical from a correctness perspective, that the first thread that modifies the data block, say Thread X, generates redo at an SCN A that is strictly smaller than SCN B (i.e. A<B), where B is the SCN for the redo generated for the same data block in the other thread, say Thread Y. Redo generated by Thread Y in this example depends on Thread X's redo. The LGWR for Thread X has to commit redo generated by Thread X before LGWR for Thread Y can commit its redo for the data block. Effectively, the RAC environment ensures that the LGWR commit code paths get serialized if there is dependent redo as described here.
With dependent redo, it is very important that the SCNs be properly ordered in the redo data. Were the database to crash, and should we need to crash recover the database, with respect to the above example, it is critical that we apply the redo generated by Thread X before we apply the redo generated by Thread Y. Changing the order of redo application (applying redo at SCN B before SCN A) would cause the database to become inconsistent—i.e. the database would contain changes to tables that never existed in the database in the past.
When a primary is operating in Maximum Availability, is using the SYNC transport, and is synchronized with the standby, Data Guard guarantees that the ordering of LGWR commits for dependent redo at the primary ORL's is maintained for the corresponding write to SRL's as well. That is, redo is written to the SRL for thread X at the standby before it is written to the SRL for thread Y at the standby.
If two threads at the primary generate redo for different data blocks, there is no particular relationship between the SCN recorded in the redo. The SCNs can be same or different.
MRP/LSP (apply processes) cannot distinguish by looking at the SCNs in redo from various threads whether is the redo is dependent or not. The apply process employs a simple algorithm that always applies all redo at SCN X from all threads before applying redo at SCNs>X. SCNs are monotonically increasing, and it is possible to have SCN gaps within the same thread.
The redo data in an ORL or SRL is stored in a log file. The particular form of log file disclosed herein is termed in the following a logfile. At 401 In FIG. 4 may be seen a diagram of a logfile in an SRL 125. Each logfile 401 has a header 403 which includes the information from the V$STANDBY_LOG view 411 for the logfile. Also included is a no data lost (NDL) flag 405. When an ORL is completed, LGWR process 109 indicates to the standby whether the standby is synchronized. If it is, RFS process 123 sets the NDL flag in the new current SRL that corresponds to the new ORL. The flag thus indicates that there are no previous gaps in the stream of redo data to which the redo data being written to the SRL's logfile belongs. If the standby and the primary lose synchronization, RFS process 123 resets the NDL flag in the current SRL. A NDL flag in the current SRL which is reset, either because it was reset when the SRL was created or because of a loss of synchronization while the current SRL was receiving redo data, is not set again in the current SRL until the gap in the stream of redo data has been eliminated and the primary and standby are again synchronized. When an end of redo event occurs, the NDL flag in the current SRL remains set as it was at the time of the event. The remainder of the logfile consists of chunks of redo data 409, each one including the SCN 407 that is associated with the change specified by the chunk of redo data.
When a primary database fails, and a failover to the standby occurs, the process of applying the received redo data to the standby's database prior to making the standby the new primary database is referred to as Terminal Apply. During terminal apply, a decision needs to be made during redo apply whether we can apply all/the redo from all/the threads that had been received in the standby that was received at the time of the primary's failure. Note that the last bit of redo received from each thread at the time of the failure can be at various SCNs. How do we know it is safe to apply all the redo from all the threads (i.e. how do we know we are not dealing with dependent redo)?
This is where the NDL flag is used intelligently—If all the SRLs that are to be applied to the standby have their NDL flag set, then the operational steps described earlier with Maximum Availability guarantee that all/the redo in the SRLs can be safely applied to the standby database during the failover. We refer to such a Terminal Apply process (where all redo from all threads can be applied during failover) as Complete Terminal Recovery.
However, if the NDL flag is not set one in one or more SRLs that are to be applied to the standby, then, the apply process during failover cannot proceed to apply redo past the highest SCN that is common across all threads. In this case, some data is lost during failover, and such a terminal apply process is referred to as Incomplete Terminal Recovery.
Problems of Prior Art Data Base Systems with Primary and Standby Database Systems
The advantage of the prior art database systems with primary and standby database systems is that because the primary receives an acknowledgment 124 from the standby only after an item of redo data has been written to the current SRL, when an end of redo event occurs and the current SRL has its NDL flag set, two things are true: there are not gaps in the stream of redo data either in or prior to the current SRL and the primary received acknowledgments for all of the redo data contained in the current SRL. Consequently, simultaneous failure of the primary and the standby cannot result in loss of data for which a commit acknowledgment was sent to the user's session. The disadvantage is the commit latency in the primary that results from the requirement that the standby wait until the redo data has been written to SRL 125 before sending commit signal 124 to the primary.
There are many circumstances in which the prior art's tradeoff between data protection and commit latency is not optimum. Simultaneous failure of the primary and standby is extremely unlikely unless the primary and standby share a single location or a single power supply. In most systems having primary and standby database systems, the primary and standby are at widely separated physical locations and have independent power supplies. What is needed is a way of obtaining a more optimal tradeoff between data protection and commit latency in systems which have a low probability of simultaneous failure of the primary and standby.