1. Field of the Invention
The present invention is generally directed toward fault detection of one or more devices. More specifically, the present invention relates to identifying faulty devices connected to a storage system communication loop such that the devices may be bypassed.
2. Discussion of Related Art
Many systems functionally include a variety of devices in order to operate. For example, a storage system may include multiple storage devices for storing large amounts of data. In the storage system example, the storage devices and storage controllers are often interconnected through a Fibre Channel (FC) loop. The storage system may be communicatively connected to a host system, such that the host system sends requests to the storage devices through an FC loop. In an FC loop, all devices are interconnected in a “daisy-chained” fashion—each to the next device in a continuous loop topology.
Occasionally, devices of the systems fail to operate according to specified standards of operation. Other devices fail completely and do not function at all, also known as catastrophic failures. When a device is not fully operational or when the device fails completely, the device may impede the operability of the overall system. For example, a failed storage device, such as a computer disk drive, in the storage system may disrupt operations of the other storage devices in the storage system by impeding communications through the FC loop. A failed device, such as the storage device, connected to the FC loop causes the FC loop to become completely non-functional.
When a device is failing and disrupting operations of the system, the device is typically replaced with another. Many systems are designed to allow for rapid replacement of failing devices. For example, many storage systems employ “hot swappable” computer disks that allow a user, such as a system administrator, to simply remove the failing computer disk and replace it with another computer disk. While the failing devices are at times relatively simple to replace, identification of the failing device is much more difficult.
In many environments, a system includes a large number of devices connected to the loop. Identification of a single failing device is at times daunting. For example, the storage system may employ hundreds of computer disks, all of which are operationally connected to the FC loop. In the storage system, if one computer disk fails to function, the entire FC loops becomes non-functional and, as such, so may the storage system. The failed or failing computer disk(s), therefore, must be identified rapidly so as to quickly replace the computer disk(s) and diminish periods of inoperability of the storage system during such a replacement. However, identifying the failed or failing computer disk(s) is a “trial and error” method as presently practiced in the art.
Identifying a failed or failing device through trial and error is an arduous task, particularly so when the system includes many devices, such as the storage system with hundreds of computer disks. The trial and error method consists of removing and reengaging devices one by one until the loop becomes operational. While each drive is temporarily removed, the storage system may be forced to run in a degraded mode of operation depending on the relevance of the removed drive to the ongoing operation of the system. The entire process of removing each device until the failed or failing device is found and reengaging the incorrectly removed devices creates large periods of “down time”. Many systems cannot afford the luxury of having such a down time. For example, a traffic management computer system may employ hundreds of computers connected to a central processing system to observe and/or control the flow of many different types of traffic, such as land traffic and air traffic. The central processing system relies heavily on a storage system to maintain data on the traffic and cannot have any portion of the overall system down for any observable length of time. A failed storage system in the traffic management system could create catastrophic collisions within the traffic.
As evident from the above discussion, a need exists for improved structures and methods for identifying faulty devices connected to a storage system communication loop.