This invention pertains generally to the field of computer memory systems, and more particularly to a method and apparatus for controlling redundant arrays of independent disks.
Modern computers frequently require large, fault-tolerant memory systems. One approach to meeting this need is to provide a Redundant Array of Independent Disk drives (RAID) usually including a plurality of hard disk drives operated by a disk array controller that is coupled to a host computer. The controller provides the brains of the memory system, servicing all host requests, storing data to or retrieving it from the RAID, caching data to provide faster access, and handling drive failures without interrupting host requests. Given the importance of the controller, numerous solutions have been suggested minimize the potential for interrupted service due to controller malfunction. One such solution calls for providing dual-active controllers having failover and failback capabilities. Dual-active controllers are a pair of controllers that are connected to each other and to all the disk drives in a RAID. In normal operation, input/output (I/O) requests from the host computer are divided between the dual-active controllers to increase the rate at which information can be transferred to or from the RAID, commonly referred to as the bandwidth of the memory system. However, in the event that one of the controllers fails, the surviving controller takes over the functions of the failed controller and begins servicing host requests addressed to the failed controller in addition to those addressed to it. The mechanism that allows this is commonly known as a failover mechanism. If the surviving controller is able to assume the functions of the failed controller without any actions on the part of the host computer, for example redirecting I/O requests to the surviving controller, the failover mechanism is said to be transparent. If the failed controller can be subsequently replaced and normal operation resumed without de-energizing or reinitializing the controllers the memory system is said to have a failback mechanism.
One example of the use of such dual-active controllers is described, for example, in U.S. Pat. No. 5,790,775, to Marks et al. uses dual-active controllers connected to the host computer by a Small Computer System Interface (SCSI) bus. Typically, the controllers are also connected to a RAID comprising multiple disk drives through a number of additional SCSI buses. Each SCSI device on a bus, such as a controller or a disk drive, is assigned one bit as an identifier (SCSI ID) to permit the host computer to select a particular controller, and the controller to select a particular disk drive. Thus, the method permits a maximum of eight devices to be identified on a standard 8-bit SCSI bus. In addition, the controllers are connected to one another by a separate communications link, and each has access to a cache memory in the other. Although both controllers are connected to every disk drive in the RAID, to permit dual-active operation each disk drive is typically under primary control of one of the controllers. This is accomplished by dividing the RAID into groups of disk drives that appear to the host computer as a logical drive or unit identified by a logical unit number (LUN) and, during initialization, associating each LUN with the SCSI ID of a particular controller. In normal operation, a controller responds only to I/O requests which are addressed to it and which refer to LUNs over which it has primary control. However, if a controller fails the remaining controller of the pair obtains configuration information, including the SCSI ID and the LUNs of the failed controller, over the communications link and begins servicing requests addressed by the host to the failed controller as well as those addressed to itself
While the above approach has been effective in reducing interruptions in service for memory systems having dual-active controllers, it is limited by the architecture of the SCSI bus. Traditionally, SCSI buses have from eight to sixteen signal lines which allows a maximum of from eight to sixteen SCSI devices to be interconnected by the SCSI bus at any one time. Thus, systems which use a 16-bit wide SCSI bus on the host side and 8-bit wide SCSI buses on the device side, typically provide for at most six device side SCSI buses having six disk drives each. Moreover, the above approach, which relies on SCSI IDs, has not been implemented using fibre interface type controllers.
Fibre interface type controllers are coupled to a host computer through one or more fibre channels. Fibre channel is the general name of a technology using an integrated set of standards developed by the American National Standards Institute (ANSI) for high speed, serial communication between computer devices. (See for example the ANSI standard X3T11, xe2x80x9cFibre Channel Physical and Signaling Interface (FC-PH),xe2x80x9d Rev 4.3 (1994), hereby incorporated by reference.) Manufacturers of RAID systems have been moving to fibre channel technology because it allows transmitting of data between computer devices at rates of over 1 Gbps (one billion bits per second), and at distances exceeding several hundred meters and more. Also, fibre channel arbitrated loop (FC-AL) allows for 127 unique loop identifiers, one of which unique identities is reserved for a fabric loop port.
The widely accepted approach to providing failover/failback capability in RAID systems comprising fibre interface controllers has been to use dual-active controllers coupled by a redirecting driver. In the event of a controller failure the redirecting driver shifts host requests from the failed controller to a surviving controller. The failed controller can then be replaced and the memory system reinitialized to return to normal, dual-active controller operation. The redirecting driver can be implemented using a software or hardware protocol. One exemplary redirecting driver is disclosed in U.S. Pat. No. 5,237,658, to Walker et al., hereby incorporated by reference. However, one problem associated with this type of solution is that it is achieved at the expense of added memory system complexity that increases cost and decreases bandwidth. In addition, when, as is common, the redirecting driver is implemented using software in the host computer, this approach is not independent of the host computer, and typically requires a special driver for each host computer system on which it is to be utilized. This further adds to the cost and complexity, and increases the difficulty of installing and maintaining the memory system.
Accordingly, there is a need for a memory system comprising a number of fibre interface controllers and having a failover mechanism that is transparent to a host computer. There is a further need for such a memory system having a failback mechanism that is also transparent to the host computer. The present invention provides a solution to these and other problems, and offers additional advantages over the prior art.
The present invention provides a memory system and method of operating a memory system. In one embodiment, the memory system includes a number of controllers connected by a fibre channel arbitrated loop to provide transparent failover and failback for failed controllers. The controllers are adapted to transfer data between a data storage system and at least one host computer in response to instructions therefrom. In the inventive method, a unique identifier is provided to each controller to permit the host compute r to address instructions to a specific controller. Then, operation of the controllers is monitored and when a failed controller is detected, a failover procedure is performed on a surviving controller. In one embodiment, the failover procedure disables the failed controller and assumes the identity of the failed controller. Thus, the surviving controller becomes capable of responding to instructions addressed to it and instructions addressed to the failed controller, and the failure of the failed controller is transparent to the host computer. In one particular embodiment, the step of providing a unique identifier to each controller preferably includes the step of providing a world wide name to each controller, and more preferably the step further includes providing a loop identifier to each controller.
In another aspect the invention provides a memory system for transferring data between a data storage system and at least one host computer in response to instructions therefrom. The memory system includes a pair of dual-active controllers connected by a fibre channel arbitrated loop. Each controller has a unique identifier and is adapted to assume the identity of a failed controller and to respond to instructions addressed to it, thereby rendering failure of the failed controller transparent to the host computer. In one embodiment, the memory system further includes a communication path coupling the controllers, the communication path being adapted to enable each controller to detect failure of the other controller. The present invention is particularly useful for data storage systems comprising multiple disk drives coupled to the controllers by disk channels, in which at least one disk channel also serves as the communication path.
In yet another aspect the invention provides a computer program and a computer program product for operating a memory system comprising a plurality of controllers, each controller having a unique identifier, and the controllers adapted to transfer data between a data storage system and at least one host computer in response to instructions therefrom. The computer program product includes a computer readable medium with a computer program stored therein. The computer program has a failure detection unit adapted to detect a failed controller . A failover unit is adapted to enable a surviving controller to respond to instructions addressed to it and to instructions addressed to the failed controller. The failover unit includes a disabling unit adapted to disable the failed controller. The failover unit also includes a loop initialization unit, which is adapted to instruct a surviving controller to assume the identity of the failed controller and to instruct the surviving controller to respond to instructions addressed to it and to the failed controller as well as instructions addressed to the surviving controller. Thus, failure of the failed controller is transparent to the host computer. In one embodiment, each controller has an active port and a failover port, and the failover unit is adapted to activate the failover port of the surviving controller. In another embodiment, the computer program product further includes a replacement detection unit adapted to instruct a replacement controller to assume the identity of the failed controller and respond to instructions to the failed controller, thereby rendering replacement of the failed controller transparent to the host computer.
In still another aspect the invention provides a memory system for transferring data between a data storage system and at least one host computer in response to instructions therefrom. The memory system comprising a pair of dual-active controllers connected by a fibre channel arbitrated loop, each controller having a unique identifier, and a means for providing a failover mode from a failed controller to a surviving controller that is substantially transparent to the host computer. In one embodiment, the means for providing a failover mode is a computer program product having a computer program including a loop initialization unit adapted to instruct the surviving controller to assume the identity of the failed controller and to instruct the surviving controller to respond instructions addressed to it and to the failed controller.