1. Field of the Invention
The present invention relates to computer-based information storage systems. More particularly, the present invention relates to systems and methods for locating a device having a failed communication port in a multi-disk storage assembly, e.g., a RAID (Redundant Array of Independent Disks) array storage assembly.
2. Relevant Background
In the computer industry, there is ongoing and increasing demand for data storage systems with more capacity as well as improved reliability. The use of RAID (Redundant Arrays of Inexpensive Disks) systems has significantly enhanced data storage reliability by providing redundancy, i.e., failure of one system component does not cause loss of data or failure of the entire system. Although initially RAID systems generally provided redundant disk drives, more functional redundancy has recently been provided by extending redundancy to device enclosures. These enclosures may include a number of components such as power supplies, cooling modules, disk devices, temperature sensors, audible and/or visible alarms, and RAID and other controllers. To provide functional redundancy, the enclosure typically includes an extra one of each of these components that is needed for proper functionality. For example, two power supply units may be provided such that if one fails the remaining power supply unit is capable of providing adequate power.
Providing monitoring and control over the devices and enclosures within each cabinet in the storage system complex has proven to be a difficult problem for the data storage industry. Mass storage systems typically include numerous multi-shelf cabinets or racks each holding multiple enclosures. The systems are adapted for replacement of individual enclosures to upgrade or modify the system or in some cases, to service an enclosure but a system of collecting status information and controlling operation of each device is required to manage the systems. Often, control devices such as array controllers are used to control the transfer of environmental data from the devices and to issue control commands to the devices, and a management tool such as a host computer with or without a graphical user interface (GUI) is provided to allow a system operator to manage device operations through the array controllers.
This arrangement has increased mass storage system complexity and cost by requiring a separate management tool or device for every array controller. Providing uniform control over the system devices is difficult with this common arrangement because accessing all the devices required operating all of the management devices and/or communicating with all of the array controllers even when the array controllers are physically located within the same cabinet. Additionally, it is difficult to allow sharing of resources between cabinets because each cabinet is typically serviced by different array controllers and/or management devices. Hence, there remains a need for an improved method and system for accessing information from and controlling operation of devices, such as enclosures and components within the enclosures, within a multi-cabinet mass storage system or complex.
In many mass storage systems, the data storage devices are connected to a host computer by a high-speed data communication link, e.g., a Fibre Channel Arbitrated Loop (FCAL), to provide a network of interconnected storage devices. Some storage network architectures use a communication protocol similar to a token ring, such that the failure of one storage device on the communication link may cause the entire communication link to fail. This can result in the catastrophic failure of large portions of storage networks.
The failed storage device must be repaired, replaced, or removed from the communication link to re-establish communication between the storage devices and the host computer. It will be apparent that the location of the failed device must be determined before the device can be repaired, replaced, or removed from the communication link. However, locating the failed device can be a time-consuming and expensive task. Large-scale storage systems may include thousands of storage devices, and each communication link in the storage system may have over one hundred devices connected to the link. Absent automated methods for locating failed devices, system administrators must test independently each device on the link to determine which device caused a failure. Manual testing processes can consume hours of administrative time, during which time the data on the communication link may not be accessible to end users of the storage system. Accordingly, there remains a need in the art for improved storage systems and for methods for locating failed storage devices in mass storage systems.