A “cluster” is made up of multiple “nodes,” each of which executes one or more database server instances that read data from and write data to a database that is located on shared storage. Each node may be a separate computing device. Nodes may communicate with other nodes and the shared storage through a network and/or other communication mechanisms.
Clusters offer many benefits not available in alternative data processing configurations. The computational power offered by a cluster of relatively inexpensive nodes often rivals the computational power offered by a single, much more expensive, computing device. Individual nodes can be added to or removed from a cluster according to need. Thus, clusters are highly scalable. Even when one node in a cluster fails, other nodes in the cluster may continue to provide services. Thus, clusters are highly fault-tolerant.
As mentioned above, each node in a cluster may execute one or more database server instances, referred to herein simply as “instances.” Each such instance may have a separate buffer cache stored in the memory of the node on which that instance is resident. When a particular instance needs to access a block of data (referred to hereinafter as a “block”) from the database, the instance determines whether the block is stored in any instance's buffer cache. If the block is stored in some instance's buffer cache, then the particular instance obtains the block from that buffer cache and places the block in the particular instance's buffer cache, unless the block is already stored in the particular instance's buffer cache. If the block is not stored in any instance's buffer cache, then the particular instance reads the block from the database and places the block in the particular instance's buffer cache. Either way, the particular instance can then access the block from the particular instance's buffer cache instead of the database. Accessing a block from a buffer cache is significantly faster than accessing a block from the database.
A block's size is typically fixed, e.g., 8 KB. The size of a block may be based on the properties of the disk on which data is stored and/or the mechanism that is used to read and write data to the disk.
When an instance accesses a block, the instance may do so for the purpose of modifying the block. The instance modifies the block that is in the instance's buffer cache. In order to reduce the amount of writing to the database, which degrades performance, the writing of the modified block to the database might be deferred for some period of time. To protect against node failure, a “redo log” stored in the database maintains a history of modifications that the instance performs on blocks. A single redo record typically contains information that pertains to a modification to one or more blocks. A single redo record is typically one or two orders of magnitude smaller than a single block. Because redo records are much smaller in size relative to the corresponding blocks, writing a redo record to a redo log in persistent storage is much faster than writing a block to persistent storage.
After a period of time, one or more modified blocks are written to the database. The point at which one or more modified blocks are written to the database is known as a “checkpoint.” Any redo records in the redo log that precede a checkpoint may be ignored because the checkpoint indicates that the blocks stored in the database are “current” at the time of the checkpoint. In other words, the blocks stored in the database reflect the changes indicated in the redo records that were recorded in the redo log before the checkpoint.
Sometimes nodes fail. When a node fails, the blocks stored in the buffer caches resident on that node may be lost. Some of those lost blocks might be blocks that were modified but not yet written to the database. In such a situation, a recovery process needs to be initiated so that the database contains the correct blocks. One recovery process is described in U.S. patent application Ser. No. 10/891,433, filed on Jul. 13, 2004, entitled, “Performance Metric-Based Selection of One or More Database Server Instances to Perform Database Recovery.”
Currently, a single surviving (referred to as the “recovery instance”) instance performs database recovery. The recovering instance must read the redo log of the crashed instance. If there is more than one crashed instance, then the recovering instance must read the redo log of each crashed instance and sort the redo records according to when the corresponding changes were committed. This process may take a considerable amount of time.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.