1. Field of the Invention
This invention relates to a system for isolating faults in a network and, in particular, to a method and apparatus for isolating faults in a network having serially connected links.
2. Description of the Related Art
Input/output (I/O) systems for interconnecting processors and control units via serial fiber optic links have been previously described in U.S. Pat. No. 5,107,489 issued Apr. 21, 1992 for SWITCH AND ITS PROTOCOL FOR MAKING DYNAMIC CONNECTIONS, the copending application, of C. J. Bailey et al., Ser. No. 07/444,190, filed Nov. 28, 1989, and U.S. Pat. No. 5,157,667 issued Oct. 20, 1992 to Carusone et al. for METHODS AND APPARATUS FOR PERFORMING FAULT ISOLATION AND FAILURE ANALYSIS IN LINK-CONNECTED SYSTEMS, all of which are assigned to the owner of this application. In the system described in these applications, one or more crosspoint switches having serial fiber optic ports may be actuated to form either a static or a dynamic connection between pairs of ports to establish bidirectional communication between a processor and a control unit coupled to one or more peripheral devices such as a printer, a direct access storage device (DASD) (e.g., a magnetic or optical disk drive), or a magnetic tape drive.
In a typical installation, a first link may interconnect a processor and a switch, while a second link may interconnect the switch and a control unit to complete the connection between the processor and the control unit. Each link may be several kilometers in length.
In a system of the type described above, one failure mode of the transmitter at one end of a fiber optic link is a gradually decreasing amount of signal emitted prior to the sudden complete loss of transmission. The processor-to-control unit interfaces of the system are designed to be tolerant of bit errors. As the rate of errors increases, the performance decreases. Beyond a certain point, the error rate will cause a noticeable degradation. The bit error rate (BER) thresholding process is designed to catch these situations before there is a complete loss of transmission or an unacceptable performance degradation. The bit error rate increases as the signal decreases. Therefore when the specified BER threshold is reached, notification is made that service is required. The system can continue in degraded operation, but since the threshold point is chosen at a point where there is only a small effect on system performance, maintenance can be deferred until a more convenient time.
Depending on the mode of operation of the switch (i.e., as a dynamic switch or a static switch), some or all of the bit errors occurring on the originating link may be propagated through the switch to the destination link. Static switches propagate all bit errors from one link to the other, while dynamic switches only propagate a portion of the errors.
Some bit errors may occur randomly, not as a result of component deterioration, at a rate that does not appreciably degrade system performance. Other errors, however, do indicate a fault in the sense of component deterioration. Errors of the first type are handled by various error detecting and correcting schemes which are well known generally and are beyond the scope of this disclosure. Errors of the second type, which are occasioned by component deterioration and which appreciably degrade system performance (although perhaps not to the extent of rendering the system inoperable), must be addressed by servicing the deteriorating component.
It is highly desirable, in a system such as the one described above having serially connected links, to be able to isolate the link on which a fault originates. The difficulty of such an apparently simple objective becomes apparent, however, when one considers how one might detect faults in a particular link. As noted above, one method that might be employed is to define a threshold error rate and to declare a fault if errors are detected at a rate exceeding the threshold. In a system having serially connected links over which errors may be propagated, this method is insufficient in itself since error reports may be generated from several such links simultaneously.
Furthermore, errors may be occurring on the source link at a rate just below the threshold, as a result of component deterioration. Since these errors are propagated to the destination link, they will be detected on that link also and, from the standpoint of the destination unit, are indistinguishable from errors arising on the destination link.
Under these conditions, a single error on the destination link, or a small number of randomly occurring errors on that link not due to component deterioration, will move the cumulative error rate detected on the destination link over the threshold. As a result, the threshold crossing will be mistakenly attributed to a fault on the destination link, even if, as in the above hypothetical, the fault is actually on the source link. What is desired, therefore, is a system of fault isolation that correctly attributes errors in such situations to a fault on the source link.