Multi-processing computer systems typically fall into three categories: shared everything systems, shared disk systems, and shared-nothing systems. In shared everything systems, processes on all processors have direct access to all volatile memory devices (hereinafter generally referred to as “memory”) and to all non-volatile memory devices (hereinafter generally referred to as “disks”) in the system. Consequently, a high degree of wiring between the various computer components is required to provide shared everything functionality. In addition, there are scalability limits to shared everything architectures.
In shared disk systems, processors and memories are grouped into nodes. Each node in a shared disk system may itself constitute a shared everything system that includes multiple processors and multiple memories. Processes on all processors can access all disks in the system, but only the processes on processors that belong to a particular node can directly access the memory within the particular node. Shared disk systems generally require less wiring than shared everything systems. Shared disk systems also adapt easily to unbalanced workload conditions because all nodes can access all data. However, shared disk systems are susceptible to coherence overhead. For example, if a first node has modified data and a second node wants to read or modify the same data, then various steps may have to be taken to ensure that the correct version of the data is provided to the second node.
In shared-nothing systems, all processors, memories and disks are grouped into nodes. In shared-nothing systems as in shared disk systems, each node may itself constitute a shared everything system or a shared disk system. Only the processes running on a particular node can directly access the memories and disks within the particular node. Of the three general types of multi-processing systems, shared-nothing systems typically require the least amount of wiring between the various system components. However, shared-nothing systems are the most susceptible to unbalanced workload conditions. For example, all of the data to be accessed during a particular task may reside on the disks of a particular node. Consequently, only processes running within that node can be used to perform the work granule, even though processes on other nodes remain idle.
Databases that run on multi-node systems typically fall into two categories: shared disk databases and shared-nothing databases.
Shared Disk Databases
A shared disk database coordinates work based on the assumption that all data managed by the database system is visible to all processing nodes that are available to the database system. Consequently, in a shared disk database, the server may assign any work to a process on any node, regardless of the location of the disk that contains the data that will be accessed during the work.
Because all nodes have access to the same data, and each node has its own private cache, numerous versions of the same data item may reside in the caches of any number of the many nodes. Unfortunately, this means that when one node requires a particular version of a particular data item, the node must coordinate with the other nodes to have the particular version of the data item shipped to the requesting node. Thus, shared disk databases are said to operate on the concept of “data shipping,” where data must be shipped to the node that has been assigned to work on the data.
Such data shipping requests may result in “pings”. Specifically, a ping occurs when a copy of a data item that is needed by one node resides in the cache of another node. A ping may require the data item to be written to disk, and then read from disk. Performance of the disk operations necessitated by pings can significantly reduce the performance of the database system.
Shared disk databases may be run on both shared-nothing and shared disk computer systems. To run a shared disk database on a shared-nothing computer system, software support may be added to the operating system or additional hardware may be provided to allow processes to have access to remote disks.
Shared-Nothing Databases
A shared-nothing database assumes that a process can only access data if the data is contained on a disk that belongs to the same node as the process. Consequently, if a particular node wants an operation to be performed on a data item that is owned by another node, the particular node must send a request to the other node for the other node to perform the operation. Thus, instead of shipping the data between nodes, shared-nothing databases are said to perform “function shipping”.
Because any given piece of data is owned by only one node, only the one node (the “owner” of the data) will ever have a copy of the data in its cache. Consequently, there is no need for the type of cache coherency mechanism that is required in shared disk database systems. Further, shared-nothing systems do not suffer the performance penalties associated with pings, since a node that owns a data item will not be asked to save a cached version of the data item to disk so that another node could then load the data item into its cache.
Shared-nothing databases may be run on both shared disk and shared-nothing multi-processing systems. To run a shared-nothing database on a shared disk machine, a mechanism may be provided for partitioning the database, and assigning ownership of each partition to a particular node.
The fact that only the owning node may operate on a piece of data means that the workload in a shared-nothing database may become severely unbalanced. For example, in a system of ten nodes, 90% of all work requests may involve data that is owned by one of the nodes. Consequently, the one node is overworked and the computational resources of the other nodes are underutilized. To “rebalance” the workload, a shared-nothing database may be taken offline, and the data (and ownership thereof) may be redistributed among the nodes. However, this process involves moving potentially huge amounts of data, and may only temporarily solve the workload skew.
Failures in a Database System
A database server failure can occur when a problem arises that prevents a database server from continuing work. Database server failures may result from hardware problems such as a power outage, or software problems such as an operating system or database system crash. Database server failures can also occur expectedly, for example, when a SHUTDOWN ABORT or a STARTUP FORCE statement is issued to an Oracle database server.
Due to the way in which database updates are performed to data files in some database systems, at any given point in time, a data file may contain some data blocks that (1) have been tentatively modified by uncommitted transactions and/or (2) do not yet reflect updates performed by committed transactions. Thus, a database recovery operation must be performed after a database server failure to restore the database to the transaction consistent state it possessed just prior to the database server failure. In a transaction consistent state, a database reflects all the changes made by transactions which are committed and none of the changes made by transactions which are not committed.
A typical database system performs several steps during a database server recovery. First, the database system “rolls forward”, or reapplies to the data files all of the changes recorded in the redo log. Rolling forward proceeds through as many redo log files as necessary to bring the database forward in time to reflect all of the changes made prior to the time of the crash. Rolling forward usually includes applying the changes in online redo log files, and may also include applying changes recorded in archived redo log files (online redo files which are archived before being reused). After rolling forward, the data blocks contain all committed changes, as well as any uncommitted changes that were recorded in the redo log prior to the crash.
Rollback segments include records for undoing uncommitted changes that remain after the roll-forward operation. In database recovery, the information contained in the rollback segments is used to undo the changes made by transactions that were uncommitted at the time of the crash. The process of undoing changes made by the uncommitted transactions is referred to as “rolling back” the transactions.
The techniques described herein are not limited to environments in which rollback segments are used for undoing transactions. For example, in some database environments, the undo and redo are written in a single sequential log. In such environments, recovery may be performed based on the contents of the single log, rather than distinct redo and undo logs.
Failure in a Shared-Nothing Database System
In any multiple-node computer system, it is possible for one or more nodes to fail while one or more other nodes remain functional. In a shared-nothing database system, failure of a node typically makes the data items owned by the failed node unavailable. Before those data items can be accessed again, a recovery operation must be performed on those data items. The faster the recovery operation is performed, the more quickly the data items will become available.
In a shared nothing database system, recovery operations may be performed using either no partitioning or pre-failure partitioning. When no partitioning is used, a single non-failed node assumes ownership of all data items previously owned by the failed node. The non-failed node then proceeds to perform the entire recovery operation itself. Because the no partitioning approach only makes use of the processing power of one active node, the recovery takes much longer than it would if the recovery operation was shared across many active nodes. This is how recovery is typically done in shared nothing databases as the recovering node needs to have access to the data of the failed node. For simplicity of the hardware configuration, a “buddy” system is typically used, where the nodes are divided into pairs of nodes, each with access to each other's data, and each responsible for recovering each other in the event of a failure.
According to the pre-failure partitioning approach, the data owned by the failed node is partitioned into distinct shared-nothing database fragments prior to the failure. After failure, each of the distinct fragments is assigned to a different non-failed node for recovery. Because the recovery operation is spread among many nodes, the recovery can be completed faster than if performed by only one node. However, it is rarely known exactly when a node will fail. Thus, for a node to be recovered using pre-failure partitioning approach, the partitioning, which typically involves dividing the main memory and CPUs of the node among the database fragments, is typically performed long before any failure actually occurs. Unfortunately, while the node is thus partitioned, the steady-state runtime performance of the node is reduced. Various factors lead to such a performance reduction. For example, each physical node's resources may be underutilized. Although multiple partitions are owned by the same physical node, the partitions cannot share memory for the buffer pool, package cache etc. This causes underutilization because it is possible to make better use of a single piece of memory rather than fragmented pieces of memory. In addition, the interprocess communication for a given workload increases with the number of partitions. For example, an application that scales to four partitions may not scale to twelve partitions. However, using the pre-failure partition approach for parallel recovery after failure, 12 partitions may be required.