The present invention relates to server clustering technologies employed to increase availability of network server computers and, in particular, to a method and system for increasing the reliability of a quorum-disk component used in clustered server systems.
The present invention relates to server clustering technologies designed to provide server redundancy and server failover in order to increase the availability of data processing and data storage functionalities provided by the clustered servers to end users. The present invention may be implemented by enhancing various functionalities included in disk arrays. Therefore, a concise background of disk and disk-array technologies is provided below.
FIG. 1 is a block diagram of a standard disk drive. The disk drive 101 receives I/O requests from remote computers via a communications medium 102 such as a computer bus, fibre channel, or other such electronic communications medium. For many types of storage devices, including the disk drive 101 illustrated in FIG. 1, the vast majority of I/O requests are either read or WRITE requests. A READ request requests that the storage device return to the requesting remote computer some requested amount of electronic data stored within the storage device. A WRITE request requests that the storage device store electronic data furnished by the remote computer within the storage device. Thus, as a result of a read operation carried out by the storage device, data is returned via communications medium 102 to a remote computer, and as a result of a write operation, data is received from a remote computer by the storage device via communications medium 102 and stored within the storage device.
The disk drive storage device illustrated in FIG. 1 includes controller hardware and logic 103 including electronic memory, one or more processors or processing circuits, and controller firmware, and also includes a number of disk platters 104 coated with a magnetic medium for storing electronic data. The disk drive contains many other components not shown in FIG. 1, including read/write heads, a high-speed electronic motor, a drive shaft, and other electronic, mechanical, and electromechanical components. The memory within the disk drive includes a request/reply buffer 105, which stores I/O requests received from remote computers, and an I/O queue 106 that stores internal I/O commands corresponding to the I/O requests stored within the request/reply buffer 105. Communication between remote computers and the disk drive, translation of I/O requests into internal I/O commands, and management of the I/O queue, among other things, are carried out by the disk drive I/O controller as specified by disk drive I/O controller firmware 107. Translation of internal I/O commands into electromechanical disk operations, in which data is stored onto, or retrieved from, the disk platters 104, is carried out by the disk drive I/O controller as specified by disk media read/write management firmware 108. Thus, the disk drive I/O control firmware 107 and the disk media read/write management firmware 108, along with the processors and memory that enable execution of the firmware, compose the disk drive controller.
Individual disk drives, such as the disk drive illustrated in FIG. 1, are normally connected to, and used by, a single remote computer, although it has been common to provide dual-ported disk drives for use by two remote computers and multi-port disk drives that can be accessed by numerous remote computers via a communications medium such as a fibre channel. However, the amount of electronic data that can be stored in a single disk drive is limited. In order to provide much larger-capacity electronic data-storage devices that can be efficiently accessed by numerous remote computers, disk manufacturers commonly combine many different individual disk drives, such as the disk drive illustrated in FIG. 1, into a disk array device, increasing both the storage capacity as well as increasing the capacity for parallel I/O request servicing by concurrent operation of the multiple disk drives contained within the disk array.
FIG. 2 is a simple block diagram of a disk array. The disk array 202 includes a number of disk drive devices 203, 204, and 205. In FIG. 2, for simplicity of illustration, only three individual disk drives are shown within the disk array, but disk arrays may contain many tens or hundreds of individual disk drives. A disk array contains a disk array controller 206 and cache memory 207. Generally, data retrieved from disk drives in response to READ requests may be stored within the cache memory 207 so that subsequent requests for the same data can be more quickly satisfied by reading the data from the quickly accessible cache memory rather than from the much slower electromechanical disk drives. Various elaborate mechanisms are employed to maintain, within the cache memory 207, data that has the greatest chance of being subsequently re-requested within a reasonable amount of time. The disk WRITE requests, in cache memory 207, in the event that the data may be subsequently requested via READ requests or in order to defer slower writing of the data to physical storage medium.
Electronic data is stored within a disk array at specific addressable locations. Because a disk array may contain many different individual disk drives, the address space represented by a disk array is immense, generally many thousands of gigabytes. The overall address space is normally partitioned among a number of abstract data storage resources called logical units (xe2x80x9cLUNsxe2x80x9d). A LUN includes a defined amount of electronic data storage space, mapped to the data storage space of one or more disk drives within the disk array, and may be associated with various logical parameters including access privileges, backup frequencies, and mirror coordination with one or more LUNs. LUNs may also be based on random access memory (xe2x80x9cRAMxe2x80x9d), mass storage devices other than hard disks, or combinations of memory, hard disks, and/or other types of mass storage devices. Remote computers generally access data within a disk array through one of the many abstract LUNs 208-215 provided by the disk array via internal disk drives 203-205 and the disk array controller 206. Thus, a remote computer may specify a particular unit quantity of data, such as a byte, word, or block, using a bus communications media address corresponding to a disk array, a LUN specifier, normally a 64-bit integer, and a 32-bit, 64-bit, or 128-bit data address that specifies a LUN, and a data address within the logical data address partition allocated to the LUN. The disk array controller translates such a data specification into an indication of a particular disk drive within the disk array and a logical data address within the disk drive. A disk drive controller within the disk drive finally translates the logical address to a physical medium address. Normally, electronic data is read and written as one or more blocks of contiguous 32-bit or 64-bit computer words, the exact details of the granularity of access depending on the hardware and firmware capabilities within the disk array and individual disk drives as well as the operating system of the remote computers generating I/O requests and characteristics of the communication medium interconnecting the disk array with the remote computers.
In many computer applications and systems that need to reliably store and retrieve data from a mass storage device, such as a disk array, a primary data object, such as a file or database, is normally backed up to backup copies of the primary data object on physically discrete mass storage devices or media so that if, during operation of the application or system, the primary data object becomes corrupted, inaccessible, or is overwritten or deleted, the primary data object can be restored by copying a backup copy of the primary data object from the mass storage device. Many different techniques and methodologies for maintaining backup copies have been developed. In one well-known technique, a primary data object is mirrored. FIG. 3 illustrates object-level mirroring. In FIG. 3, a primary data object xe2x80x9cO3xe2x80x9d 301 is stored on LUN A 302. The mirror object, or backup copy, xe2x80x9cO3xe2x80x9d 303 is stored on LUN B 304. The arrows in FIG. 3, such as arrow 305, indicate I/O write operations directed to various objects stored on a LUN. I/O write operations directed to object xe2x80x9cO3xe2x80x9d are represented by arrow 306. When object-level mirroring is enabled, the disk array controller providing LUNs A and B automatically generates a second I/O write operation from each I/O write operation 306 directed to LUN A, and directs the second generated I/O write operation via path 307, switch xe2x80x9cS1xe2x80x9d, 308, and path 309 to the mirror object xe2x80x9cO3xe2x80x9d 303 stored on LUN B 304. In FIG. 3, enablement of mirroring is logically represented by switch xe2x80x9cS1xe2x80x9d 308 being on. Thus, when object-level mirroring is enabled, any I/O write operation, or any other type of I/O operation that changes the representation of object xe2x80x9cO3xe2x80x9d 301 on LUN A, is automatically mirrored by the disk array controller to identically change the mirror object xe2x80x9cO3xe2x80x9d 303. Mirroring can be disabled, represented in FIG. 3 by switch xe2x80x9cS1xe2x80x9d 308 being in an off position. In that case, changes to the primary data object xe2x80x9cO3xe2x80x9d 301 are no longer automatically reflected in the mirror object xe2x80x9cO3xe2x80x9d 303. Thus, at the point that mirroring is disabled, the stored representation, or state, of the primary data object xe2x80x9cO3xe2x80x9d 301 may diverge from the stored representation, or state, of the mirror object xe2x80x9cO3xe2x80x9d 303. Once the primary and mirror copies of an object have diverged, the two copies can be brought back to identical representations, or states, by a resync operation represented in FIG. 3 by switch xe2x80x9cS2xe2x80x9d 310 being in an on position. In the normal mirroring operation, switch xe2x80x9cS2xe2x80x9d 310 is in the off position. During the resync operation, any I/O operations that occurred after mirroring was disabled are logically issued by the disk array controller to the mirror copy of the object via path 311, switch xe2x80x9cS2xe2x80x9d and pass 309. During resync, switch xe2x80x9cS1xe2x80x9d is in the off position. Once the resync operation is complete, logical switch xe2x80x9cS2xe2x80x9d is disabled and logical switch xe2x80x9cS1xe2x80x9d 308 can be turned on in order to reenable mirroring so that subsequent I/O write operations or other I/O operations that change the storage state of primary data object xe2x80x9cO3,xe2x80x9d are automatically reflected to the mirror object xe2x80x9cO3xe2x80x9d 303.
Storage subsystems, such as disk arrays, are normally incorporated into larger systems of server computers and storage subsystems that together provide data processing and data storage resources to a large number of clients interconnected to server computers and storage subsystems via many different types of communications technologies. Originally, a server computer was considered to be a relatively independent processing resource that manages some number of data storage resources closely interconnected with the server. However, as the need for more robust and more reliable data processing and data storage resources have evolved in parallel with the evolution of software and computer hardware and in software applications, various methodologies for combining individual servers and data storage subsystems together to form larger, distributed data processing and data storage resources have become common. One such collective data processing and data storage resource is known as a xe2x80x9cserver cluster.xe2x80x9d
A server cluster comprises a number of interconnected and cooperating servers that together provide data processing resources and data storage resources to a large number of client computers. FIG. 4 illustrates the logical organization or a two-server server cluster. The server cluster 400 includes a first server computer 402 and a second server computer 404 interconnected via a communications medium, such as a fibre channel or specialized (e.g. SCSI) bus 406. The server cluster illustrated in FIG. 4 is an example of a modern shared-nothing cluster in which each server controls and manages, at each point in time, its own data storage resources. For example, in the server cluster shown in FIG. 4, when both servers 402-404 are operating normally, server 402 manages and controls storage devices 408-410 and server 404 manages and controls storage devices 412-414. Generally, data is stored redundantly, at least to some extent, within the cluster so that, if one or a portion of a storage device fails, the server cluster can failover to a backup storage device containing redundant data. If a server within the server cluster fails, then those applications that were running on the failed server can be restarted on another, still-operating server within the cluster, and clients previously linked to applications on the failed server can relink to the restarted applications on a still-operating server within the server cluster. When a server fails within the cluster, the disk and other data storage resources controlled and managed by the failed server are acquired, for control and management purposes, by still-operating servers within the cluster.
A well-known problem characteristic of server clusters is that, when communications between servers within the cluster fail, an operating-but-no-longer-intercommunicating server may consider itself to be the only surviving server within the cluster, and attempt to acquire the storage devices of other servers and to restart applications running on other servers of the cluster. Various scenarios involving failures of communications between servers of a server cluster are known as xe2x80x9csplit-brainedxe2x80x9d scenarios, a characterization no doubt borrowed from physiological and neurological terminology describing pathological conditions that result from disruption of nervous-system pathways interconnecting the two hemispheres of a brain.
One technique used to avoid split-brain scenarios involves establishing a quorum disk (xe2x80x9cQDxe2x80x9d) within a server cluster for which each cluster attempts to gain control during cluster startup procedures and in response to detected cluster component failures. In the server cluster illustrated in FIG. 4, disk 416 is designated the QD. During cluster startup, both servers 402-404 contend for control of the QD 416 by issuing RESERVE commands to the QD 416. Generally, the server whose RESERVE command is first received by the QD 416 becomes the owner of the QD and, as a result, becomes the dominant server within the cluster. During cluster formation, coordinated by the dominant server, cluster configuration information may be stored on the QD, and following cluster formation, cluster-wide events may be logged to the QD. While the dominant server of a cluster maintains ownership of the QD, no other server may issue WRITE requests to the disk.
During normal cluster operations, servers exchange heartbeat messages in order to monitor the operational states of the servers and other components within the cluster. When a server fails to receive heartbeat signals from the dominant server of the cluster, due either failure of the dominant server or failure of communications interconnections linking the server to the dominant server, the server may transition to a challenging server in order to attempt to gain control of the cluster, by issuing a RESERVE command to the QD, and then, if successful, reform the cluster and become the new dominant server. The challenging server issues the RESERVE command and, if the RESERVE command succeeds, reforms the cluster as the new dominant server. If the RESERVE command fails, the challenging server may issue a REST command and wait, for some period of time, to detect whether or not another server issues a subsequent RESERVE command in response. If no subsequent RESERVE command is issued by another server to the QD, then the challenging server may issue a RESERVE command directed to the QD, gain ownership of the QD, and reform the cluster. However, if the dominant server of the cluster issues a RESERVE command to the QD in response to the server""s RESET command, then the dominant server maintains control over the cluster. By using a quorum-disk-contention protocol in each server of the cluster, split-brain scenarios can be avoided.
The logical organization of a two-server cluster, illustrated in FIG. 4, can be achieved via numerous different physical hardware implementations. FIG. 5 illustrates a small computer systems interface (xe2x80x9cSCSIxe2x80x9d) implementation. In the SCSI implementation, servers 402 and 404 are linked together and are linked to the various storage resources of the cluster 408-410, 412-414 and 416, via a SCSI bus 502. FIG. 6 illustrates a fibre-channel implementation of the two-server cluster illustrated in FIG. 4. In FIG. 6, servers 402 and 404 are linked through fibre channel switches 602 and 604 to the data storage resources 408-410, 412, 414, and 416. The fibre-channel implementation provides a more robust interconnection between components of the server cluster that, in turn, substantially increases the availability of data processing and data storage resources provided by the cluster system.
Although server cluster technology is a vast improvement over dependent, non-cooperating servers, commonly available clustered server systems cannot guarantee non-stop provision of data processing and data-storage resources. If, for example, cluster administrators and applications that run on the server cluster do not deliberately and carefully maintain redundant data, the failure of a data-storage device may be unrecoverable. Certain failures within interconnecting communications media, or failures of multiple components, such as multiple servers, may also be at least temporarily unrecoverable. Perhaps the greatest vulnerability to the clustered server systems described with reference to FIGS. 4-6 is that the QD 416 represents a potential single point of failure. If the QD fails, or the communications medium connecting the QD to cluster servers fails, then the server cluster also fails. A new cluster can be started by choosing another disk as the QD, but in commonly available cluster technologies, restarting the cluster using a new QD requires manual intervention by human administrators. However, the intent of cluster technologies is to provide reliable operation and automated failover following automated detection of component failure. Therefore, manufacturers, designers, and users of server cluster technologies have recognized the need for removing the single point of failure represented by the QD in currently available server clusters.
One embodiment of the present invention enhances the reliability and robustness of a server cluster by removing the single point of failure represented by the QD used to arbitrate management of a server cluster. In currently available server cluster technologies, a single disk drive or logical unit is designated as the QD. A server within the server cluster acquires ownership of the QD and is thereby promoted to become the dominant server of the server cluster. It is the dominant server""s responsibility to manage cluster initialization and oversee cluster maintenance functions. Failure of the single QD causes currently available server clusters to fail, requiring a fully or partially manual cluster restart procedure. In one embodiment of the present invention, the functionality of a QD is distributed between two physical mirrored disk drives or mirrored LUNs. The failure of either one of the two mirrored disk drives or mirrored LUNs can be automatically compensated for by the server cluster, without interruption of data processing and data storage services provision to cluster clients and users. The two mirrored disk drives or LUNs that together compose a distributed QD can be located in geographically dispersed locations to create a geographically distributed QD in order to protect a server cluster from large-scale, but geographically localized, threats, such as earthquakes, floods, and fires. A distributed or geographically dispersed QD that represents another embodiment of the present invention comprises a dominant LUN associated with a uniform request queue and a subordinate LUN that mirrors the dominant LUN. Because the subordinate LUN is a mirror copy of the dominant LUN, failure of the dominant LUN can be automatically compensated for within the server cluster by immediately failing over to the subordinate LUN. Failure of the subordinate LUN is compensated for by forwarding all requests or operations directed to the subordinate LUN to the dominant LUN.