1. Field of the Invention
The invention relates generally to methods and apparatus for isolating and analyzing faults in link-connected systems such as, for example, data processing systems arranged as a distributed network of host processors, switches and control units coupled by a plurality of communication links. More particularly, the invention relates to methods and apparatus for isolating faults in such systems (or networks), utilizing fault reports generated from within the system itself. The reports are transmitted to a central location, preferably during a predetermined time period, and are used to create a single error message identifying the probable nature and location of the fault. A preferred embodiment of the invention does not require either the construction or maintenance of systemwide configuration tables, commonly used in performing fault location and analysis.
2. Description of the Related Art
Various techniques are known for isolating faults in distributed networks, such as data processing systems, where the components of the system are coupled by a plurality of communication links. For example, diagnostic software may be employed to perform specific tests which can aid an operator in determining the location of a fault. Such software typically produces an error log, often containing multiple entries relating to a single fault event. An operator is usually required to analyze logged data before a conclusion can be reached regarding fault location.
U.S. Pat. No. 4,633,467 teaches a specific example of how software may be used to isolate faults in a computer system. In particular, hardware units in the system generate error reports in response to detected error conditions. A report list may be generated from the individual reports utilizing, for example, software embodying the methods taught in the referenced patent. This software not only provides a history of faults, but ages them based on elapsed time compared to a most recent fault. A weighting process is employed to help isolate faulty units.
The methods taught in the U.S. Pat. No. 4,633,467 require configuration information to be maintained and retrieved in order to implicitly determine which units are in active communication paths. These units then become the candidates for the fault location.
The end result of the analysis process taught in the U.S. Pat. No. 4,633,467 is a list which may contain multiple entries resulting from a single fault. Thus, the list needs to be analyzed by the operator to finally isolate the fault. Additionally, no diagnosis is rendered regarding the probable cause of the fault.
As indicated hereinabove, a timer-based mechanism is used in the referenced fault analysis process; however, timing is only used as a basis to exclude certain reports.
U.S. Pat. Nos. 4,727,548 and 4,745,593 disclose fault isolation systems that are similar to the one described in the U.S. Pat. No. 4,633,467. All three of these patents utilize timeout schemes in some fashion.
According to the invention disclosed in the U.S. Pat. No. 4,727,548, timeouts are used to create an activity window within which to detect faults on a signal link. If a transition does not occur within the timeout window, a fault on the link is indicated.
According to the invention described in the U.S. Pat. No. 4,745,593, a test packet is sent through the nodes of a network and a timeout scheme is used to check for an anticipated response. An error is noted if the response fails to be observed.
The inventions taught in the patents referenced hereinbefore are all prone to generate multiple error reports for a single fault; none of the references automatically integrate records to avoid multiple error messages and produce a single error message for the operator. Additionally, all of the above schemes require some type of global configuration information (like a configuration table) to be maintained in order to identify the probable source of a fault.
Still other techniques for isolating faults are set forth in U.S. Pat. Nos. 4,554,661 and 4,570,261.
The U.S. Pat. No. 4,554,661 utilizes hardware to act as a status filter to look for changes in system error status. These changes are indicative of either a detected fault or that a fault was repaired. Faults can be recognized as being inside a component, outside the component, or not isolated.
As with the software-based approaches to fault location, the hardware-based scheme taught in the U.S. Pat. No. 4,554,661 requires systemwide configuration information to be generated and maintained. Furthermore, multiple errors resulting from a single fault can still be generated and additional testing or analysis is required in such cases to isolate the fault.
In the U.S. Pat. No. 4,570,261, a voting scheme is taught which may be used to perform fault isolation. The scheme is also timer based and, similar to the timer-based aging scheme referred to above, the votes are weighted before deciding upon a possible source of the error.
The U.S. Pat. No. 4,570,261 is useful in a distributed system; however, like all the other patents cited hereinabove, configuration information, usually in the form of a configuration table, needs to be created and maintained. Multiple error reports for a single error event are also prone to be output to the operator when utilizing the teachings of the U.S. Pat. No. 4,570,261.
Furthermore, none of the techniques in the referenced patents performs an automatic synthesis of error reports in a distributed, link-connected system, to isolate and identify a single fault location, and at the same time provide a diagnosis of the cause of the fault.
It is desirable to diagnose the cause of a fault at the time a fault is located. This is particularly true when service personnel need to be dispatched (often to customer premises) to remedy a problem. Data pertaining to the probable cause of a fault, if obtained prior to dispatching service personnel, would aid in minimizing (or eliminate in part) the time and expense associated with (a) first visiting a site to determine the parts or equipment required to correct a problem, (b) returning to a central supply facility to get the parts or equipment, (c) returning to the equipment site, etc.
With the advent of optical transmission media, optoelectronic system components, etc., it is now possible to distribute the aforementioned networks over distances of up to several kilometers. Previously, when a system fault was detected there was little chance of dispatching service personnel to the wrong location since all the equipment in the system was typically separated by at most a few hundred feet and located in a common building. More recently, however, as equipment in a single network may be geographically dispersed, it is important that both fault location and analysis (relative to the cause of the fault) be performed with enough precision to send the service personnel to the right place, with the right equipment, to rectify a problem.
The ability to send service personnel to the right place with an advanced diagnosis of the cause of a fault becomes even more important when the components used in the system are subject to high failure rates.
Distributed networks of the type referred to hereinabove, provide a context in which the present invention may be used to great advantage. Such networks are typified by the system described in copending patent application Ser. No. 07/429,267, filed Oct. 30, 1989. Application Ser. No. 07/429,267 describes a switch and its protocols for making connections between one input/output channel (of a CPU) and either another input/output channel or a peripheral device (via a peripheral device control unit (CU)), in a data processing system. Patent application Ser. No. 07/429,267 is hereby incorporated by reference.
The system described in the incorporated copending application uses switch units installed between the CPUs and the CUs to allow connectivity from a single CPU network connection to multiple CUs, and from a single CU network connection to multiple CPUs. The bidirectional connection between two units, including the transmission medium plus the transmitters, receivers and related electronics on both ends, is called a link. The transmitter, receiver and related electronics at one end of a link is called a link attachment.
When a failure occurs on a link, symptoms occur at both ends of that link and may propagate through the switch units and appear at both ends of multiple links. The symptoms of a failure thus appear on both ends of the failing link as well as propagating to ends of non-failing links. This results in the error being detected at multiple locations. It would be desirable if these failure reports could be gathered into one place and analyzed in such a fashion as to determine which link is failing and what the probabilities are of the failure having occurred in the various elements of that link.
As indicated hereinabove, when prior art techniques are used, multiple reports from a failure result in multiple messages to operators indicating the failure, multiple failure records in multiple locations, and the possibility of multiple calls for service for the same failure. The analysis of this information and determination of what type of service should be rendered is a time consuming process.
Each switch and most CUs have multiple link attachments with paths to CPUs so that when a single path or link fails, operation and communication can continue. In most installations the CPUs communicate with each other or they may each communicate to a central location.
It would be desirable to take advantage of these multiple link attachments and the ability of CPUs to communicate with each other and/or to a central location, in networks such as the one described in the incorporated copending application, to assure failure information as seen by units in the network can be collected over not only a primary link (which itself may be faulty); but over an alternative reporting link as well.
Additionally, it would be desirable if, in such a network, multiple failure reports generated for single failures could be collected for analysis in a central location, and if a method could be provided for determining which reports belong to a specific incident without the need for a knowledge of the complete configuration of the network.
In order to analyze the multiple failure reports that occur from a single incident, it must be determined which of the failure reports received at the central point are from a single incident. A knowledge of the configuration of all of the CPUs, CUs and switches could, as indicated hereinbefore, be kept in a table, but there are difficulties in constructing such a table and dynamically keeping it up to date.
Furthermore, it would be desirable to be able to isolate a fault to a particular one of the plurality of units (or a particular link) in a network in situations where simply determining the source of a set of reports may not be enough information to isolate a fault. For example, it would be desirable to be able to identify a unit that failed and is itself unable to issue an error report.
For all of the reasons stated hereinabove, it would be desirable to provide methods and apparatus which can perform fault isolation and analysis, and which feature the ability to (a) automatically generate fault location information and a diagnosis of the probable cause of the fault; (b) provide the aforesaid information without the need to create or maintain systemwide configuration information, e.g., a system configuration table; (c) provide a way to collect error reports and isolate a fault even if a primary reporting path in a distributed link-connected system is down; (d) provide the operator with a single error message corresponding to a single failure event even when multiple error reports associated with the event are generated; and (e) precisely isolate a fault to one of a multiplicity of units (and/or links) in a distributed link coupled system.