1. The Field of the Invention
The invention relates to redundant arrays of independent disks (RAID) in client/server computing environments, and more specifically to systems and methods for reliable failover capabilities involving write operations of a failed node within a cluster.
2. The Relevant Art
In contemporary client/server computing environments, a cluster is a set of coupled, independent computer systems called nodes, which behave as a single system. A client interacts with a cluster as though it were a single server. The combined computing power and storage of a cluster make the cluster a valuable tool in many applications ranging from on-line business to scientific modeling. In many instances, the reliability of these systems is critical to the overall success of a business endeavor or scientific experiment.
The most vulnerable component of a computer system, including cluster systems, are the hard disk drives which contain essentially the only mechanical, moving parts in the otherwise electronic assembly. Data written to a single drive is only as reliable as that drive, and many drives eventually do fail. The data stored on these hard disk drives in many cases represent critical client information, investment information, academic information, or the like. In an age when information storage and access is becoming increasingly important to all enterprises, more reliable methods of data storage are needed.
One existing storage method is a redundant array of independent disks (RAID). RAID systems store and access multiple individual hard disk drives as if the array were a single, larger disk. Distributing data over these multiple disks reduces the risk of losing the data if one drive fails, and it also improves access time. RAID was developed for use in transaction or applications servers and large file servers. Currently, RAID is also utilized in desktop or workstation systems where high transfer rates are needed.
In a cluster environment, such as the one described above, RAID and similar shared disk arrays are implemented to provide a client with access to the computing power of the combined nodes together with the large storage capacity of the disk array. FIG. 1 shows a schematic representation of a cluster system 100 of the prior art. Shown therein are a node cluster system 102, a network hub 104, a cluster administrator 106, and a plurality of clients 108. The depicted node cluster system 102 is shown by way of example as a two node system comprising two nodes 110 which are typically computer systems or servers. Node cluster systems 102 may comprise any number of nodes 110, the quantity of which is defined by the storage and computing capacity required.
Depicted within each node 110 is a RAID controller 112, which will be discussed in greater detail below with respect to FIG. 2. Through the RAID controllers 112, the nodes 110 transfer data to a RAID array 114. The RAID controllers 112 communicate with the RAID array 114 through data channels 116. In the depicted embodiment, the data channels 116 connecting the RAID controllers 112 and the RAID array 114 are preferably small computer system interface (SCSI) channels.
The cluster system 102 connects to a Local Area Network (LAN) 120 or a private network cable or interconnect 118. Under the depicted embodiment, the cluster system 102, cluster administrator 106, and the plurality of clients 108 are connected by the network hub 104. The cluster administrator 106 preferably monitors and manages cluster operations. Occasionally, a RAID controller 112 in one of the nodes 110 fails, generally due to a component or power failure. When this occurs, non-cached write operations may be underway and incomplete. As a consequence, critical data may be lost.
Referring now to FIG. 2, current RAID controllers 112 generally consist of a microprocessor 202, a SCSI controller 204, a flash read-only memory (ROM) module 206, a dynamic random access memory (DRAM) module 208, and a non-volatile random access memory (NVRAM) module 210. Within the NVRAM module 210 resides a mirror race table (MRT) 214. The MRT 214 maintains the beginning logical block address of each group of data that is undergoing a write operation on RAID disks. The group of data may be striped across the disks or organized as smaller groups of data known as cache line groups,
FIG. 3a illustrates one embodiment of an MRT 214 of the prior art. Shown therein are a valid flag bit 302, a logical block address 304, and a logical drive number 306. The MRT 214 maintains a list of incomplete write operations. When a write operation is completed, the write operation's MRT entry is cleared by the RAID controller 112. When a failure occurs, the RAID controller 112 performs a consistency check upon returning to functionality.
Occasionally, a RAID controller 112 may fail. In such a case, other functioning RAID controllers have no access to MRT 214 of the failing controller 112. The remaining RAID controllers 112 cannot identify or make consistent incomplete write operations of the failed controller 112. However, the remaining RAID controllers 112 can identify the logical drives of the RAID 114 pertained to the failed controller 112. A remaining controller 112 will initiate a background consistency check (BGCC) on those logical drives, each from the beginning to the end, and if necessary, a consistency restoration where data inconsistency due to an incomplete write is found.
With the logical drive sizes currently in use, a BGCC of said logical drives of the RAID array 114 may take several hours. During this period of time, read and write operations are allowed to occur in the foreground on those logical drives not completely checked yet involved in the BGCC of the RAID array 114. A data corruption problem may occur in the event that a physical drive of one of said logical drives of the RAID disk array 114 fails a read request that happens to be located in a yet-to-be made consistent cache line group. This data corruption failure is a result of a RAID controller 112 regenerating data from other physical drives of this logical drives without realizing the data was inconsistent to begin with. This problem is commonly known as a “write hole.”
Thus, it can be seen from the above discussion that a need exists in the art for an improved reliable failover method and apparatus for resolving incomplete RAID disk writes after a disk failure.