The present invention relates to a technology to guarantee high reliability in operation of a plurality of controllers for input/output (I/O) devices in a computer system, and in particular, to a method of redundantly arranging controllers capable of transferring a process therebetween without intervention of the user and host systems when failure occurs in one of the controllers in an external storage subsystem adopting a Small Computer Systems Interface (SCSI) in which the controllers are arranged at least in a duplicated configuration and the controllers can be accessed from the host systems.
In a system configuration employing the SCS in which a plurality of controllers and a storage shared between at least two controllers are connected by an interface cable in a daisy chain to the host systems, the plural controllers respectively have different port addresses such as SCSI-IDS. Ordinarily, these controllers process I/O requests designated according to pertinent port addresses specified by the host systems.
JP-A-4-364514 describes a system in which the controllers are arranged in a multiplex configuration such that I/O requests from a host apparatus to storages connected to the plural controllers are processed at a high speed. In such a conventional system, when failure occurs in one of the controllers, and when the host system alters the specification of the controller to execute the I/O request, it is possible that the I/O request is processed by a normal controller. However, in a system in which the host system and the plural coontrollers are connected to each other in a daisy chain, considerations have not been given to a procedure in which when failure occurs in a controller, the process is transferred to a normal controller for the execution thereof without intervention of the host system.
After issuing an I/O request to a controller, the host system ordinarily monitors termination of the I/O request by a timer in the host system. When the I/O is not terminated even when the monitor time predetermined by the host system lapses after the issuance of the I/O request, the host system assumes the state temporarily as an error. Conducting processes such as bus recovery process of an SCSI bus, the host system tries to re-issue the same I/O request with specification of the port address of the failed controller.
When the controller does not respond to the re-issued I/O request, the host system regards the state as a permanent error and hence does not thereafter issue any I/O request to the failed controller. Upon failure of a controller in the conventional system, when the host system recognizes the permanent error the data process thereof is interrupted. Therefore, even when there are disposed a plurality of controllers, user intervention is required to continuously execute the data process of the host system when failure occurs in the Pertinent controller.
Furthermore, when there are disposed a plurality of host systems, and when a controller fails and enters a hang-up situation with the bus occupied by the failed controller, another data process being executed between another host system and another controller is also interrupted. User intervention is also required to recover the interrupted data process.
It is therefore an object of the present invention to provide a failure recovery method and system in which when a failure occurs in a controller, the process thereof is transferred to a normal controller to continuously perform the data process without any intervention by the host system or user.
Additionally, when the failed controller has not yet received the I/O request from the host system and hence the error has not been assumed, it is necessary to possibly suppress I/O requests to the failed controller to prevent an abnormal operation. Consequently, in accordance with the present invention, the transfer of the port address and control information is executed after suppressing an event in which the host systems issue I/O requests thereto.
To achieve the object above according to the present invention, a normal controller has a function to receive control information of the failed controller and a function to reference the port address of the failed controller to add the contents thereof to its own port address. Furthermore, the normal controller possesses a function to reset the port address in the failed controller to thereby erase the port address.
Due to these functions, the normal controller can receive the port address and control information of the failed controller and accept and execute the I/O request issued to the failed controller. In the operation, a method may be employed in which the port address is reset by the pertinent failed controller.
Moreover, according to the present invention, there is disposed a function that the normal controller monitors a bus such as an SCSI bus upon detection of the failure to thereby decide whether or not the failed controller has already received the I/O request from the host system. When the failed controller has already received the I/O request from the host system, the transfer of the port address and control information of the failed controller is terminated to prevent the host system from recognizing the permanent error so as to continue the process of the host system without any intervention by the user or host system.
In addition, when the normal controller is executing an I/O process upon detection of a failure in a controller, it is assumed that the failed controller does not yet receive the I/O request from the host s:istexe2x80x2M. According to the present invention, there is provided a function to detect the condition such, that the transfer of the port address and control information of the failed controller is accomplished during the I/O process execution of the normal controller.
As a result, I/O requests from the host system to the failed controller can be suppressed until the port address transfer process is completed. In addition, when a bus such as an SCSI bus is not being used by any controller upon detection of the failure, it is considered that the failed controller has not yet received the I/O request from the host system. According to the present invention, there is provided a function in which the condition is detected and the normal controller selects the failed controller such that the transfer of the port address and control information is executed after the selection is accomplished. Due to this function, I/O requests from the host system to the failed controller can be suppressed until the port address transfer process is completed. Owing to adoption of the construction of this type, in a situation in which a failed controller have received an I/O request and the execution of the I/O process has not been terminated with a bus such as an SCSI bus kept exclusively reserved by the failed controller, a normal controller detects the state, completes reception of the port address and control information, and resets the failed controller within the I/O monitor time of the host system. This makes it possible that any subsequent I/O requests to the failed controller can received for execution thereof by the normal controller. As a result, the system can respond to the I/O request re-issued from the host system and hence the interruption of the process of the host system as well as the inhibition of issuance of I/O requests from the host system can be prevented.
Moreover, upon detection of a failure in a controller, the normal controller can suppress I/O requests from the host system to the failed controller. Therefore, when the failed controller has not yet received the I/O request, the host system need not recognize the error and any subsequent I/O requests can be received by the normal controller, thereby implementing the nonstop system operation.