This invention relates to cluster computer systems in general. More particularly, the invention relates to recovering from cable failure in cluster computer systems with RAID devices.
Historically, data-center operators running critical applications demanding high reliability have turned to mainframes, minicomputers and the like running complex fault-tolerant software on complex fault-tolerant hardware. In a different market niche of less critical and less demanding environments, the Microsoft Corp. Windows operating system has made significant inroads into business data centers, running on relatively inexpensive and uncomplicated personal-computer and server platforms. These Windows platforms were adequate for certain servicesxe2x80x94database and e-mail, for example.
However, databases and e-mail are becoming increasingly important in the average business. Indeed, in some businesses these functions have taken on a critical nature. Accordingly, data-center operators with now-critical database, e-mail and similar applications want to run them on systems with high reliability. They are unwilling, however, to pay the costs of mainframes, minicomputers and their fault-tolerant software. In response to market demand, Microsoft Corp. has modified its Windows operating system to address the issue of high reliability.
Specifically, Microsoft now offers a Cluster Service product. Venerable if not ancient in the art, a xe2x80x9cclusterxe2x80x9d can be loosely defined as a parallel or distributed system of interconnected whole computers (sometimes called xe2x80x9csystemsxe2x80x9d but herein termed xe2x80x9cnodesxe2x80x9d for clarity). The user of a cluster system logically views and uses it as a single, unified computing resource or service.
Generally speaking, a cluster enables the sharing of a computing load over several nodes without the user or client needing to know that more than one constituent node is involved. If any hardware or software component in the cluster system fails, the user or client may notice degraded performance but does not lose access to the service. The cluster system disperses the load from the failed component to the remainder of the cluster system. Conversely, if the user or client notices the need for more of a given resource (for example, processing power), that resource is simply added to the running cluster system, and the performance of the cluster system as a whole improves.
Well known in the art and only generally described here, the Microsoft Cluster Service product is the collection of all cluster-activity-management software on each node of a Microsoft cluster system. The Cluster Service is more fully described in xe2x80x9cMicrosoft Windows NT Server Cluster Strategy: High Availability and Scalability with Industry-Standard Hardware,xe2x80x9d (Microsoft Corp., 1995) and xe2x80x9cConcepts and Planning: Microsoft xe2x80x98Wolfpackxe2x80x99 Clustering for Windows NT Serverxe2x80x9d (Microsoft Corp., 1996). These two Microsoft clustering documents are attached hereto as Appendices A and B and are incorporated by reference as well.
A Microsoft cluster system uses the Small Computer Systems Interface (SCSI) bus with multiple initiators as the storage connection (although Microsoft envisions supporting the Fiber Channel in the future). Well known in the art, SCSI is an evolving standard directed toward the support of logical addressing of data blocks on data devices. Documents detailing the variations of SCSI over time (SCSI-1, SCSI-2 and SCSI-3, for example) are available from the American National Standards Institute (ANSI) of New York, N.Y. (www.ansi.org). SCSI-1, SCSI-2 and SCSI-3 are together referred to as xe2x80x9cSCSIxe2x80x9d herein.
FIG. 1 illustrates a two-node cluster system 100 implemented on a SCSI bus 110 according to the prior art. In FIG. 1, the cluster system 100 includes a first server node 120a and a second server node 120b. The server nodes 120a and 120b have respective SCSI identifiers (SCSI IDs) 7 and 6. The server nodes 120 connect to the SCSI bus 110 through respective host bus adapters (HBAs) 121.
A node 120 typically includes one or more of the following: a central processor unit (xe2x80x9cCPUxe2x80x9d) 126, a memory 122, a user interface 123, a co-processor 124, ports 125, a communications interface 121 and an internal bus 127.
Of course, in an embedded system, some of these components may be missing, as is well understood in the art of embedded systems. In distributed computing environment, some of these components may be on separate physical machines, as is well understood in the art of distributed computing.
The memory 122 typically includes high-speed, volatile random-access memory (RAM) 1221, as well as non-volatile memory such as read-only memory (ROM) 1223. Further, the memory 122 typically contains software 1222. The software 1222 is layered: Application software 12221 communicates with the operating system 12222, and the operating system 12222 communicates with the I/O subsystem 12223. The I/O subsystem 12223 communicates with the user interface 123, the co-processor 124 and the communications interface 121 by means of the communications bus 127.
The communications interface 121, in this embodiment, is a host bus adapter 121.
The communications bus 127 communicatively interconnects the CPU 126, memory 122, user interface 123, co-processor 124 and communications interface 121.
To the SCSI bus 110 are also connected SCSI devices 130. The devices 130a through 130c can be, for example, physical disks with SCSI IDs 0 through 2, respectively.
Local disks 150 connect to respective nodes 120 as necessary.
FIG. 20 illustrates the physical view of a second cluster system 2000 implemented on a SCSI bus 110 with an external RAID controller 2060, according to the prior art. As in the cluster system 100, the cluster system 2000 includes the first and second server nodes 120. The server nodes 120 have respective SCSI IDs 7 and 6 and connect to the SCSI bus 110 through respective HBAs 121. Each of the nodes 120 runs software 1222.
To the SCSI bus 110 is also connected the device 130a and a RAID controller 2060 with respective unique SCSI IDs. Additional SCSI devices 2061 attach to the RAID controller 2060 by means of a SCSI bus 2062. The devices 130, 2061 can be physical disks, for example.
Again, local disks 150 connect to respective nodes 120 as necessary.
FIG. 21 illustrates the logical view of the cluster system 2000 of FIG. 20. The device 130 and the RAID controller 2060 each appears to the host 120 as a single SCSI device. The RAID controller 2060 organizes the devices 2061 to appear to the host 120 as logical units (LUNs) 2063 of the SCSI device 2060.
FIG. 22 illustrates the physical view of a third cluster system 2200 with internal RAID controllers 2210 and multiple shared SCSI channels 110, according to the prior art. As in the previous systems, the cluster system 2200 includes the first and second server nodes 120. with respective SCSI IDs 7 and 6. The server nodes 120 connect to multiple SCSI buses 110 through respective RAID controllers 2210 and run the software 1222.
To each SCSI bus 110 is connected at least one device 2061, each device 2061 having a SCSI ID unique for the channel 110 to which it connects. The devices 2061 can be physical disks, for example. Local disks 150 again connect to respective nodes 120 as necessary.
FIG. 23 illustrates the logical view of the cluster system 2200 of FIG. 22. The RAID controllers 2210 organize the devices 2061 to appear to the host 120 as SCSI disks 130 on a single SCSI channel 110. The RAID controllers 2210 thus appear to the host 120 as HBAs 121.
In this sense, the RAID controllers 2060, 2210 hide the complexity of the RAID SCSI disks 2061 and the controllers 2060, 2210 themselves from the hosts 120.
The SCSI standard implements Reserve( ) and Release( ) commands. This pair of commands allows a SCSI initiator (for example, a node 120) to reserve a SCSI target or logical unit on a SCSI target and later to release it. In the prior art, the usual handling of one of these commands in a Microsoft Windows 95/98 cluster system 100, 2000, 2200 involves an HBA 121 passing the command to the target, which then executes it.
Where the SCSI target of a Reserve( ) command is a logical unit 2063 of an external RAID controller 2060 or where the SCSI target is a logical disk 130 depending from an internal RAID controller 2210, the controller 2060, 2210 still passes the Reserve( ) command to all of the disks 2061 that compose the target. This pass-through method, however, is patently inefficient, reserving more devices 2061 than the initiator 120 may require. The pass-through method also imposes limitations on a RAID configuration.
The implementation of a RAID device 2060 in a cluster environment presents another problem, this with respect to disk failure. In a non-cluster environment, rebuilding a logical device 2063. 130 in the face of failure is a well-practiced art: A controller restores data from a mirroring physical drive to a replacement physical drive. In a non-cluster environment, the logical choice of which node 120 is to rebuild the failed logical device 2063, 130 is the one and only node 120 holding the reservation to any of the physical units 2061.
In a cluster environment, however, multiple nodes 120 can hold a reservation to a physical unit 2061 through reservations to logical devices 2063, 130 comprising that unit 2061. Further, one node 120 can reserve a logical device 2063, 130 while a different node 120 receives the command to rebuild the logical device 2063, 130.
Accordingly, it is desirable to handle more efficiently and less restrictively the SCSI Reserve( ) and Release( ) commands in a cluster environment with RAID devices.
Also, in a cluster environment with RAID devices, it is desirable to rebuild a logical unit in a manner simple and localized to the affected nodes.
These and other goals of the invention will be readily apparent to one of skill in the art on reading the background above and the description below.
Herein are described apparatus and methods for detecting failure of a node in a cluster computer system. The apparatus include controllers programmed to cooperate.
In one embodiment, the apparatus include first and second nodes with respective bus controllers communicatively coupled to each other and to a logical I/O device by means of a bus. The first node firstly recognizes the the second node as a node in the cluster.
At some later point, a node whose communicative coupling has failed node is coupled to the cluster a second time. The first node recognizes this second coupling and, in response, queries the second node for failure-status information.
The first and second nodes negotiate membership in the cluster for the second node on the first node""s determining that the second node was the node that failed between the first and second couplings.
Various embodiments follow: The first node firstly recognizes the second node as either a master or a slave node in the cluster. Then the negotiating of the second node""s cluster membership includes ceasing to recognize the second node as the master node, negotiating slave membership in the cluster for the second node and thirdly recognizing the first node as the master node, on determining on the first node that the second node failed between the first and second couplings and that the second node was firstly recognized as the master node.
Before the node is communicatively coupled to the cluster a second time, the node fails to keep itself communicatively coupled to the cluster.
The second communicative coupling includes resetting the bus. Recognizing the second coupling includes recognizing the resetting of the bus.
The first node""s querying the second node for failure-status information comes in the context of the first node""s querying each node of the cluster for failure-status information. The first node determines by means of the failure-status information whether the second node failed between the first and second couplings.
On determining that the second node failed between the first and second couplings and that the second node was secondly recognized as a slave node, the first node negotiates slave membership in the cluster for the second node.
On determining that the second node did not fail between the first and second couplings, the first node accepts the second node as a member of the cluster with its master/slave status intact.