1. Technical Field
This application relates to determining alternate paths in faulted systems.
2. Description of Related Art
Computers, computer networks, and other computer-based systems are becoming increasingly important as part of the infrastructure of everyday life. Networks are used for sharing peripherals and files. In such systems, complex components are the most common sources of failure or instability. The proliferation of multiple interacting components leads to problems that are difficult or impossible to predict or prevent. The problems are compounded by the use of networks, which introduce the added complexity of multiple machines interacting in obscure and unforeseen ways.
Additionally, the need for high performance, high capacity information technology systems is driven by several factors. In many industries, critical information technology applications require outstanding levels of service. At the same time, the world is experiencing an information explosion as more and more users demand timely access to a huge and steadily growing mass of data including high quality multimedia content. The users also demand that information technology solutions protect data and perform under harsh conditions with minimal data loss and minimum data unavailability. Computing systems of all types are not only accommodating more data but are also becoming more and more interconnected, raising the amounts of data exchanged at a geometric rate.
To address this demand, modern data storage systems (“storage systems”) are put to a variety of commercial uses. For example, they are coupled with host systems to store data for purposes of product development, and large storage systems are used by financial institutions to store critical data in large databases. For many uses to which such storage systems are put, it is highly important that they be highly reliable so that critical data is not lost or unavailable.
A typical data storage system stores and retrieves data for one or more external host devices. Such a data storage system typically includes processing circuitry and a set of disk drives (disk drives are also referred to herein as simply “disks” or “drives”). In general, the processing circuitry performs load and store operations on the set of disk drives on behalf of the host devices.
In certain data storage systems, the disk drives of the data storage system are distributed among one or more separate disk drive enclosures and processing circuitry serves as a front-end to the disk drive enclosures. The processing circuitry presents the disk drive enclosures to the host device as a single, logical storage location and allows the host device to access the disk drives such that the individual disk drives and disk drive enclosures are transparent to the host device.
In the aforementioned data storage system, the processing circuitry and the disk drive enclosures are typically interconnected in a serial manner using a number of cables to provide the front end processing circuitry with access to any of the individual disk drives of the disk drive enclosures. For example, in the case where the data storage system includes multiple disk drive enclosures, a first cable electrically couples the processing circuitry to a first enclosure, a second cable electrically couples the first enclosure to a second enclosure, a third cable electrically couples the second enclosure to a third enclosure, and so on until each of the disk drive enclosures in the data storage system are serially coupled to the processing circuitry.
For example, Fibre Channel is a high performance, serial interconnect standard for bi-directional, point-to-point communications between servers, storage systems, workstations, switches, and hubs. Fibre Channel standards are described by the Fibre Channel Industry Association (FCIA) (http://www.fibrechannel.org). Fibre Channel employs a topology known as a “fabric” to establish connections between nodes. A fabric is a network of switches for interconnecting a plurality of devices without restriction as to the manner in which the switch can be arranged. A fabric can include a mixture of point-to-point and arbitrated loop topologies.
Because of the high bandwidth and flexible connectivity provided by FC, FC is a common medium for interconnecting devices within multi-peripheral-device enclosures, such as redundant arrays of inexpensive disks (“RAIDs”), and for connecting multi-peripheral-device enclosures with one or more host computers. These multi-peripheral-device enclosures economically provide greatly increased storage capacities and built-in redundancy that facilitates mirroring and fail over strategies needed in high-availability systems. Although FC is well-suited for this application with regard to capacity and connectivity, FC is a serial communications medium. Malfunctioning peripheral devices and enclosures can, in certain cases, degrade or disable communications. FC-based multi-peripheral-device enclosures are expected to isolate and recover from malfunctioning peripheral devices.
In particular, an FC interface which connects devices in a loop such as a Fibre Channel Arbitrated Loop (FC-AL) is widely used in disk array apparatuses and the like, since it has a simple connecting configuration of cables and can easily accommodate device extensions. In this type of interface, when signals cannot propagate in the loop because of failures or the like in interface circuits of connected devices (this is called, for example, loop abnormality or link down), the whole loop cannot be used. That is, even though a failure occurs in only one device, all devices connected to the loop cannot be used. Thus, disk array apparatuses usually have interface circuits for two ports, so that these devices are connected to two independent loops. With this configuration, even when one loop of the dual loop interfaces is out of use because of a failure or the like, accesses can be performed using the other loop, to thereby improve reliability.
In a data storage system, if a component is bad and is causing loop disturbance in such a way that the loop is “bouncing” causing software to re-initialize the loop repeatedly, it can cause input/output data transactions (I/Os) to be queued up and can cause multiple drives to be removed, input/output performance to be degraded, and can ultimately lead to a data unavailable/data loss (DU/DL) situation. Whenever the loop is unstable, conventionally, software removes drives that are reporting errors but the bad component may not be a drive. Since conventionally the bad component is not actually being removed, more instability results and ultimately I/Os get backed up, and the situation can lead to DU/DL. Also, conventionally, since I/Os can get backed up before they are resumed, the situation can lead to performance degradation and storage processor (SP) crashes. Furthermore, conventionally, identification of the bad component can be difficult for the user and multiple parts may end up being replaced.