The present invention relates to computer systems.
There are many fields in which mankind has become reliant on computers to perform valuable and sometimes essential functions. The reliance on computer systems demands that the down time of computer systems is as small as possible. The down time of a computer system is a period during which a computer system is inoperable as a result of a fault in the system. If a computer system goes down, the inconvenience and loss of revenue and indeed life endangering effects can be substantial. As result, the reliability of computer systems is arranged to be as high as possible.
In a co-pending U.S. patent application, Ser. No. 09/097,485, a fault tolerant computer system is disclosed in which multiple processing sets operate to execute substantially the same software, thereby providing a amount of redundant processing. The redundancy provides a facility for detecting faults in the processing sets and for diagnosis and automatically recovering from the detected faults. As a result, an improvement in the reliability of the computer systems is effected, and consequently the down time of such fault tolerant computer systems is likely to be substantially reduced.
Computer systems are generally comprised of a processor and memory connected via an I/O bus to utility devices which serve to provide under control of the processor particular functions. Although redundant processing sets within a computer system provide a facility for detecting, diagnosing and recovering from errors in the processing sets, the utility devices within the computer system, including the connecting buses and peripheral buses, may fail from time to time. A device failure can cause disruption in the operation of the computer system, and may even cause the computer system to go down. Conventionally, detecting and identifying a faulty device has required the presence of a skilled technician.
It is therefore desirable to provide a computer system in which a faulty device or a replaceable unit containing the faulty device can be readily identified, so that repair can be effected quickly, and down time of the computer system can be reduced.