Typically, a distributed computer system includes a number of processors coupled to one another by an interconnection network. One of the processors has the task of monitoring for device failures within the computer system. For example, a heartbeat type protocol is used to periodically poll each of the devices in the system to determine if it is still active. If a once active device is no longer active, then the processor probes the device to find out if an error has occurred. The time required to poll all of the devices grows proportionately with the increase in the size of the system.
When a failure is detected, the processor needs to communicate with the failed device to determine the cause of the failure, as well as to initiate the appropriate recovery scheme. For example, if a failure occurs within the interconnection network, then the processor needs to communicate with the network to retrieve fault information captured by the interconnection network and to initiate appropriate recovery. However, since there is no guarantee that a direct connection exists between the interconnection network and the processor, alternate mechanisms are generally used for this communication.
The use of a processor to search and retrieve fault information in such a manner, and the further use of alternate mechanisms to retrieve the fault information when the error occurs in an interconnection network are less efficient than desired. Thus, a need exists for a more efficient way of reporting errors to a processor for servicing. In particular, a need exists for a mechanism in which the reporting is performed by, for instance, the interconnection network itself, instead of having the processor search and retrieve the fault information.