This application claims priority from Japanese Patent Application Reference No. P00-032873, filed Feb. 10, 2000.
The present invention relates to techniques for use in a storage subsystem and an information processing system, and in particular to techniques for detecting and recovering from errors occurring in storage subsystems having two or more components linked together by a communication link with a loop topology such as the fibre channel loop.
Conventional high capacity storage subsystems can be comprised of two or more hard disk drives which are connected by a Fibre Channel (FC). In the connecting topology of the FC Loop (FIBRE CHANNEL ARBITRATED LOOP (FC-AL)), each drive and a controller which controls the drive in the storage subsystem are connected with one another by a loop topology. A port bypass circuit (PBC) is installed in a connecting part between each drive and the FC Loop in order to disconnect the drive from the FC Loop when the drive incurs a failure or is to be replaced by another drive.
The Fibre Channel, one of the super gigabit technologies, has been standardized under the name, xe2x80x9cANSI NCITS T11xe2x80x9d (ANSI X3 T11 by former name).
While certain advantages are perceived, opportunities for further improvement exist. For example, according to conventional FC Loop technology, once the fibre channel loop is broken at any point, it becomes substantially impossible to communicate between a controller and each drive connected to the fibre channel loop.
What is needed are techniques for improving for detecting and recovering from errors occurring in disk drive subsystems having a controller and drive units connected by a fibre channel loop.
According to the invention, techniques for detecting and recovering from errors occurring in disk drive subsystems having a controller and drive units connected by a fibre channel loop are provided. Specific embodiments can provide storage subsystems, methods and apparatus for use in information processing environments, for example. Embodiments can determine when each drive is disconnected from the loop in the external storage subsystem structured by using the FC Loop, and thereupon, the FC Loop can be controlled by bridging the communication path using the PBC so that the loop is not broken.
An object of the present invention is to provide the storage subsystem equipped with the communicating means of loop topology, for preventing the decrease in the performance and/or reliability to the minimum, even if any failure occurs on the storage subsystem.
Another object of the present invention is to provide the storage subsystem equipped with the communicating means of loop topology, for determining the failing part and for recovering from the failure quickly, simply and precisely.
Another object of the present invention is to provide the storage subsystem equipped with multiple communicating means of loop topology, for recovering reliably from the multiple failure having influence upon the multiple loops of communicating means.
An object of the present invention is to provide the information processing system equipped with the communicating means of loop topology, for minimizing the decrease in the performance and/or reliability, even if any failure occurs in the information processing system.
Another object of the present invention is to provide the information processing system equipped with the communicating means of loop topology, for determining the failing part and for recovering from the failure in the processing system quickly, simply and precisely.
Another object of the present invention is to provide the information processing system equipped with multiple communicating means of loop topology, for recovering from multiple failure having influence upon the multiple loops communicating means.
In a representative embodiment according to the present invention, a storage subsystem is provided. The disk storage subsystem can include a plurality of storage drives, a plurality of controllers to control said storage drives, and a plurality of data communication loops to connect the storage drives and the controllers and to exchange information between the controllers and the storage drives, a first bypass mechanism that connects and disconnects at least one of each of the storage drives and each of the controllers individually to each of the communication loops, and a second bypass mechanism that bridges each of the communication loops at a specified location to selectively isolate a portion of the communication loop. Responsive to detecting a failure, at least one of the controllers commands at least one of the first and second bypass mechanisms to successively disconnect and re-connect each of the storage devices to each of the communication loops under control of the controller through the other of the communication loops, to locate a cause of the failure.
In another representative embodiment according to the present invention, an information processing system is provided. The information processing system can comprise a plurality of component units, each of which performs at least one of storing information and processing information, a data communication loop to connect the component units and to exchange information with each other within the component units, a first bypass mechanism to control the connection and disconnection of each of the component units individually to and from the communication loop, and a second bypass mechanism to bridge the communication loop at a specified location and to selectively isolate a part of the communication loop. Responsive to detecting a failure, at least one of the component units commands at least one of the first and second bypass mechanisms to successively disconnect and re-connect each of the component units to the data communication loop to locate a cause of the failure.
In a further representative embodiment according to the present invention, a storage subsystem is provided. The storage subsystem can comprise a plurality of storage devices, linked to a plurality of controllers to control the storage devices by a plurality of data communication loops. The communication loops connect the storage devices and the controllers to exchange information between the controllers and the storage devices. The storage subsystem can also comprise a first plurality of bypass switches. Each bypass switch operable to connect an associated one of the storage devices, and each of the controllers individually to each of the communication loops and to disconnect the associated one of the storage devices and the each of the controllers individually from each of the communication loops. A second plurality of bypass switches can also be part of the subsystem. Each switch can be operable to connect, in a first operating state, to a group of the plurality of storage devices and their respective associated bypass switches, for electrical signal communications with the one or more of the plurality of controllers. In a second operating state, the second plurality of bypass switches provides for electrically isolating the group of storage devices and their respective associated bypass switches from communicating with the at least one of a plurality of controllers, while maintaining other storage devices in the communication loop. Responsive to detecting a failure, at least one of the controllers commands at least one of the first and second plurality of bypass switches to disconnect and re-connect at least one of the storage devices to at least one of the communication loops under control of the controller through the other of the communication loops.
In a yet further representative embodiment according to the present invention, a method for detecting and recovering from errors occurring in disk drive subsystem is provided. The disk subsystem can have a plurality of controllers that control a plurality of storage devices, the controllers and storage devices interconnected by a plurality of communication loops, including a first communication loop and a second communication loop. The method can include monitoring the communication loops for a presence of a failure. If a failure is detected, the method can disconnect successive disk storage units connected by the communication loops beginning at a point farthest from one of the plurality of controllers and determining whether the failure has been recovered from. If the failure has been recovered from, the method can determine an identity of a component being a probable cause of the failure based upon an identity of a switch that lead to recovery. Finally, the method can also include indicating the identity of the component that suffered a failure.
In specific embodiments, a storage subsystem having multiple drives and controllers that are connected with a communication loop topology, such as FC_AL, are provided. In addition, PBCs (first bypass mechanism) can be used to disconnect the drives and controllers from the loop. Further PBCs (second bypass mechanism) can be installed to bridge and divide the loop at any desired location within the loop.
By controlling these PBCs, the location of failing part in the loop can be determined. In specific embodiments, the location of the failing part can be determined by repeating operation to confirm the availability of the communication for the effective portion of the loop varying effective portion of the loop by controlling the PBC. If any operable portion of the loop is detected, the detected operable portion within the loop continues to be used, and only the inoperable portion of the loop is switched to the another loop; thereby, the decrease in the performance can be prevented to the minimum.
In additional specific embodiments, instructions for controlling the PBC are not issued through the communicating loop. Rather, a dedicated bus for controlling the PBC is provided. Therefore, any failing part can be isolated even if both of the duplicated loops are failing simultaneously. Communication is still available using the remaining operable portion of the loop.
Numerous benefits are achieved by way of the present invention over conventional techniques. These and other benefits are described throughout the present specification. A further understanding of the nature and advantages of the invention herein may be realized by reference to the remaining portions of the specification and the attached drawings.