It is not uncommon today for a computer system to be quite complex, often including multiple processors configured to provide parallel and/or distributed processing. For example, multi-processor computer systems often include not only multiple main processing units (MPUs), but may also include multiple support processors or agents, such as memory processors and the like. These various processors, as well as other system resources such as memory, input/output devices, disk devices, and the like, may be distributed throughout the computer system with communication provided by various buses. For example, a computer system may comprise a number of sub-modules, referred to herein as cells or cell cards, having a number of system resources, such as main processing units (MPUs), agents, and/or memories, and buses disposed thereon. System resources of a sub-module may make and/or service requests to and/or from other system resources. Such system resources may be associated with the same sub-module and/or other sub-modules of the system.
To service requests from multiple system resources in an orderly and predictable manner, systems may implement various bus protocols and transaction queues. For example, transaction queues may store information with respect to particular transactions “in-process” for a particular system resource. A processor, for example, may issue a large number of transactions, such as a number of memory reads, wherein a header and/or data return, such as a memory return, is expected in normal operation. If the processor issues a transaction and the transaction does not return, then the processor may experience an, perhaps critical, error condition.
If an error in operation of any aspect of the system, such as with respect to any one of the aforementioned system resources, is detected by the system, an error signal may be generated to notify the appropriate system resources. Such errors may be non-critical, such as isolated to the operation of a single system resource and/or associated with a recoverable operation. However, such errors may be critical in nature, such as requiring initialization of an entire bus (referred to herein as a bus initialization or BINIT error) and, therefore, the system resources thereon.
Generally it is desirable to avoid particular error conditions, particularly critical error conditions wherein system resources are “hung” from further processing or which require initialization of a plurality of system resources. Therefore certain predicable events, such as failure of particular transaction returns, may be dealt with in a manner calculated to minimize impact upon system operation, such as to avoid a critical error situation.
There may be a number of reasons for a failure to receive a transaction return, such as where a call has been made to some hardware that has not been installed in the system or which has been removed and the software is attempting to discover the hardware environment. Accordingly, systems have implemented timing operations wherein the system will time-out in a predictable and graceful way, such that the software will realize, for example, that a particular piece of hardware does not exist and processing will continue. In the past, time-out circuits were implemented specifically with respect to a particular apparatus and/or event. For example, some input output (I/O) systems may have protocols that if a particular I/O card is not coupled to the system, a time-out counter associated with that particular I/O card will indicate a time-out period and the system will continue to process after a failed return from that I/O card. However, in a large system, implementing time-out counters with respect to each apparatus and/or event for which time-out processing may be desired can be prohibitively expensive, both in resources and processing overhead.