The present invention relates to a system and methods for isolating faults in a network, particularly in a fibre channel arbitrated loop (FCAL) or other multidevice computer system in which one or more components may fail.
Locating faults in a network such as an FCAL is a challenging and time-consuming undertaking, in particular where the loop includes many devices, each of which may undergo intermittent failures. In systems currently in use, logs are kept of failed commands, data transfers, responses, etc., so that diagnostics may be performed in attempting to locate the sources of failures.
Such diagnostics typically involve attempts to replicate a given fault condition, often in a trial-and-error manner, removing and/or replacing components until a faulty component is identified. This is a nondeterministic approach, since intermittent faults by definition do not occur every time a given state of the network occurs (e.g. a given FCAL configuration with a given I/O command). Thus, an engineer may spend a considerable amount of time and resources fruitlessly attempting to isolate an FCAL error, and additionally may replace more components than necessary, i.e. may replace nonfailing components along with a failed component, due to insufficient knowledge about the location of a failure.
Thus, fault isolation techniques can result in wasted time, efforts and equipment. In addition, the difficulties inherent in fault isolation using current techniques can lead to extended periods of down time for a system or subsystem, and a local error can thus have a broad effect, affecting the productivity of all users on the loop or network.
Accordingly, a system is needed that can isolate errors in a network in a manner that is at best deterministic and at worst at least reduces trial-and-error attempts to locate failing components relative to current methods.