1. Field of the Invention
This invention is related in general to the field of data storage systems. In particular, the invention consists of a system for isolating error conditions in a data communication fabric.
2. Description of the Prior Art
In FIG. 1, a computer storage system 10 includes host servers (“hosts”) 12, data processing servers 14, data storage devices 16 such as redundant arrays of inexpensive/independent disks (“RAIDs”), and a data communication system 18. Requests for information traditionally originate with the hosts 12, are transmitted by the communication system 18, and are processed by the data processing servers 14. The data processing servers retrieve data from the data storage devices 16 and transmit the data back to the hosts 12 through the communication system. Similarly, the hosts 12 may write data the to the data storage devices 16.
The communication system 18 may be a communication bus, a point-to-point network, or other communication scheme. FIG. 2 illustrates a communication fabric 20 including a symmetrical multi-processor (“SMP complex”) 22, a fabric controller 24, and a host adapter 26. The SMP complex 22 is a component of the data processing server 14 (FIG. 1) and the host adapter 26 is the interface for the host servers 12 (FIG. 1). Various error conditions may occur within any of these components. These error conditions may be critical, i.e., preventing the device from functioning, or may be transitory in nature. If a critical error occurs, the failed device must be re-initialized or replaced. However, transitory errors may be addressed according to the severity and frequency of the error.
Some errors result from faulty cables, power transients, or defective components. Some of these types of errors can be tolerated and accommodated by the communication fabric 20 as spurious events. However, a large number of non-critical errors may indicate impending component failure or that a component is in an unstable state requiring re-initialization. Counters may be used to track these non-critical errors. When a counter exceeds a pre-determined threshold, corrective action may be taken by resetting a device, quiescing a device so that it may be repaired, or fencing a device so to prevent further errors.
One problem is that a failure of any component of the communication fabric 20 may generate additional error conditions known as sympathy errors. These sympathy errors incorrectly increase the counts of the error counters. In order to accommodate this situation, the thresholds must be set higher than would otherwise be necessary in order to prevent premature resetting, quiescing, or fencing. This results in a system that is aware of an error condition and the most likely culpable component but has not experienced the error with enough frequency to overcome the artificially-high threshold. The problem is only compounded as the number of fabric components is increased. Accordingly, it is desirable to have a system for isolating and addressing error conditions. Additionally, it is desirable to resolve the error condition in the smallest possible amount of time.
In U.S. Pat. No. 4,627,054, Cooper et al. describe an interconnect and isolation mechanism for multiple computer processing units (“CPUs”) joined on a processor bus. Cooper discloses isolating a failed CPU so that the rest of the system can continue operation. However, Cooper does not focus on detection of the failure or any failures that can be correlated back to the culpable component.
In U.S. Pat. No. 4,999,838, Horikawa discloses a system wherein a set of main processors has a peripheral processor and a means for returning the peripheral processor to an operational state after failure. However, Horikawa does not disclose a method of diagnosing error conditions to determine which peripheral processor is faulty and in need or service prior to complete failure.
In U.S. Pat. No. 5,237,677, Hirosawa et al. disclose using service processors to detect faults in remote processing units. Hirosawa describes storing the fault information and using that stored info to teach the system how to remedy the faults when later encountered. However, the system tries to generate standardized recovery processes based on current fault data, and stored fault data. This requires that the error condition continue until either the faulty device fails or an error threshold is exceeded. Accordingly, it is desirable to have a system that forces the error the manifest itself so that it may be isolated.
In U.S. Pat. No. 6,182,248, Armstrong et al. describe an error injection circuit and methodology that generates faults on a bus by driving the logic high or low, simulation normal noise and error conditions, and monitoring the bus traffic (clocks, data signals, error signals). However, the communication fabric 20 of a computer storage system 10 is an extremely complex system requiring a specific and complex diagnostic schema. Accordingly, it is desirable to have a system of isolating errors in a complex system.