In a clustered computing system, such as for example the clustered computing system 101 illustrated in FIG. 1 having two (or more) host computer system 104, 106 coupled via respective SCSI controllers 108, 110 and a Small Computer System Interface (SCSI) bus 112 to a common or shared disc drive box 114, it is advantageous be able to perform maintenance on one of the nodes 104, 106 without necessitating taking down the entire cluster, or subjecting the cluster to the possibility of failure as a result of removing or powering-down a node (for example node 2) while the other node (for example, node 1) is still operating on the bus 112. This powered-on removal of a component (node) or addition of a component (node) from the SCSI bus is referred to as "hot-plugging" or "hot-unplugging". In this example, we refer to the two hosts or nodes to a drive box 114 containing one or a plurality of hard disk drives 118 (118a, 118b, 118c) such as for example a RAID disk array within a SAF-TE enclosure; however, nodes 104, 106 may be connected to SCSI bus supported devices other than disk drives or drive arrays. We note that hot-plugging, hot-unplugging, and hot-swapping may refer to connecting components to the SCSI bus on either the drive side (for example, plugging in the drive box 114 itself, or the individual disk drives 118) or on the node side (for example, the host system nodes 104, 106). Here, we are concerned primarily with hot-plugging the one of the host systems from the node side of the bus, where in conventional systems the node side is the side nearest the host system nodes, and the drive side is the side nearest the disk drive box 114.
The physical structure and communications protocol of the SCSI bus, particularly relative to the first or SCSI-1 protocol (but also including SCSI-2, SCSI-3, SPI-2, and other present and anticipated future variants of SCSI) developed substantially independent of any recognized need for hot-plugging components to the SCSI bus or removing SCSI components from the bus. It is known that proper operation of an SCSI bus and the components operating on the bus depends upon proper termination of the SCSI bus as described in the SCSI Standards specification. When proper termination is not provided, operation may be erratic at best, and typically attempts to operate a SCSI based cluster without proper termination at the nodes will frequently crash the entire clustered system.
In the exemplary clustered system of FIG. 1, it is noted that each of the node 104, 106 include a controller 108, 110 and that each of the controllers include electronic circuitry or other means to implement an appropriate SCSI termination 120, 122 for the end of the SCSI bus as it interfaces with a respective node 104, 106. While the particular manner in which SCSI bus termination is provided is not important, and for example the termination may be provided without the use of an explicit controller, it is important the termination be maintained for proper operation. With this in mind, it is clear that removal of a node (e.g. node 104), or its controller 108, would also remove the SCSI termination 120 and leave the entire cluster 101 susceptible to failure.
It is sometimes necessary to remove nodes, either because of an actual failure or problem with a component of the node, because normal maintenance is due, to upgrade the node or components within the node, or for other reasons. Frequently, the equipment at each node (such as the host computer systems 104, 106) are provided in the form of a Field-Replaceable Unit (FRU) to permit relatively rapid removal and/or replacement with minimum or zero down time. These FRUs are intended to be hot-plugged, hot unplugged, or hot-swapped. Note that the term "hot-swapped" typically is used to denote the removal of one device and its replacement by another, while hot-plugged may in some contexts refer to the addition of a device that had not been present just prior to that addition, or removal of a device without the connection of a replacement. We use the terms interchangeably here, and provide additional clarification if the distinction is relevant to the particular description or use of the term.
While device hot-plugging is currently used in many conventional computer systems, such conventional use is not without problems. When a FRU is hot-plugged to an SCSI bus, there is a strong likelihood that either the overall cluster system 101 will crash (that is, for example, the system may hang, fail, or go into some other non-operational state), that one or more of the remaining node or nodes will experience problems or hang, or that other problems, particularly problems associated with communicating data from the drive box 114 by remaining nodes, will occur.
Therefore is desirable to provide some degree of isolation of such nodes, whether in the form of a Field-Replaceable Unit (FRU) or not, from the rest of the system in the event that the node or FRU is removed from or added to the cluster 101, to avoid the undesirable effects of hot-plugging onto an active system or the undesirability of downing the cluster or remaining nodes for maintenance. In the remainder of this description, we will refer to nodes, such as host systems 104, 106 or the controllers 108, 110 within as an FRU, with the understanding that the important factor is the location of the SCSI terminator 120, 122 within the node or FRU, and independent of whether the node would actually be classified as a Field-Replaceable unit.
One conventional approach to providing some limited degree of isolation in instances where an FRU 104, 106 is added or removed from the cluster 101, has been the provision of an SCSI repeater 132, 134 between the FRU (for example node 104) and the rest of the clustered system (for example, node 106 and drive box 114), such as is illustrated in FIG. 2. In this example, a SCSI repeater 132, 134 is provided between each of the FRUs 104, 106 and the shared drive box 114.
Unfortunately, even with this SCSI repeater 132, 134 approach, the FRU SCSI bus segment 136 is not isolated from the System SCSI Bus Segment 138 on the other side of SCSI repeater 132 or the remainder of the cluster system 101 as a result of the normal operation of conventional SCSI repeaters. Any disturbance of the FRU SCSI bus segment 136 resulting from removal of FRU 104 or the absence of the FRU 104 from the bus will be transmitted to the system SCSI Bus Segment side 138 of the SCSI bus since the function of the SCSI repeater 132 is to repeat any signal from one side (e.g. the FRU side 136) to the other side (e.g. the system side 138). While for actual signals it is desirable to reshape signal waveforms for retransmission, it is not desirable to reshape noise or other disturbance.
In spite of these limitations of the use of the SCSI repeater 132, 134 in this manner and its associated problems, the use of SCSI repeaters during node disconnection has continued but the applicability is typically limited to systems where the SCSI is in "quiet down" mode, that is, that for situations where activity on the SCSI bus has been stopped, and when so stopped, the system will not hang. Unfortunately, since the system activity on the SCSI bus must be stopped, the system is not usable during that time.
More recently, two-node clustering, such as the Microsoft.TM. Wolfpack.TM., has become popular in the fault-tolerant server environment. These two-node clusters use the SCSI bus as the back end I/O interface to achieve some measure of cluster system fault tolerance. While the SCSI bus is not the ideal interface for device or FRU hot-plugging or un-plugging; in maintaining fault-tolerance., some hot-plugging for system maintenance is highly desirable even though introduces other problems, and some degree of protection is better than none at all.
Therefore there remains a need for structure and method for replacing the SCSI bus termination at a system node when the system node or components thereof are added or removed, such as when hot-plugging or hot-unplugging an FRU, and for isolating the FRU SCSI bus segment from the remainder of the cluster during the modification itself and with the FRU removed from the cluster, while maintaining operation of the system while that node is down.