It is not uncommon today for a computer system to be quite complex, often including multiple processors configured to provide parallel and/or distributed processing. For example, multi-processor computer systems often include not only multiple main processing units (MPUs), but may also include multiple support processors or agents, such as memory processors and the like. These various processors, as well as other system resources such as memory, input/output devices, disk devices, and the like may be distributed throughout the computer system with communication provided by various buses. For example, a computer system may comprise a number of sub-modules, referred to herein as cells or cell cards, having a number of system resources, such as main processing units (MPUs), agents, and/or memories, and buses disposed thereon. System resources of a sub-module may make and/or service requests to and/or from other system resources. Such system resources may be associated with the same sub-module and/or other sub-modules of the system.
To service requests from multiple system resources in an orderly and predictable manner, systems may implement various bus protocols and transaction queues. For example, bus protocols may establish an order in which a plurality of transactions, e.g., requests, snoops, and responses, are to be performed, and perhaps a number of bus cycles each such transaction is to be provided for completion. Similarly, transaction queues may store information with respect to particular transactions “in-process” with respect to a particular system resource. An in-order queue, for example, may be implemented to ensure that particular transaction phases are implemented in a proper order by an associated system resource. Accordingly, an in-order queue may track a number of transactions through their in-order phases, such as might include a request phase, a snoop phase, and a response phase. Similarly, an out-of-order queue or transaction table, for example, may be implemented to track transaction execution which may be returned in any order.
If an error in operation of any aspect of the system, such as with respect to any one of the aforementioned system resources, is detected by the system, an error signal may be generated to notify the appropriate system resources. Such errors may be non-critical, such as isolated to the operation of a single system resource and/or associated with a recoverable operation. However, such errors may be critical in nature, such as requiring initialization of an entire bus (referred to herein as a bus initialization or BINIT error) and, therefore, the system resources thereon.
A bus initialization error, or similar critical error, in a multi-processor system can lead to widespread failure, even system-wide failure, due to the interdependency of the various system resources to issue and/or respond to requests and responses. Although a single processor system may be able to recover from a bus initialization error by purging all pending transactions and fetching new instructions (e.g., “soft booting”), a multi-processor system bus initialization may result in a system “lock-up” requiring a “hard” reset or may be prevented from performing a core dump useful in isolating the source of the error. For example, a bus initialization error may cause particular system resources to “hang” awaiting an anticipated transaction response when a bus upon which a system resource performing the transaction is initialized due to the system resources on the initialized bus ceasing to track their associated queues, and thus ceasing to provide awaited responses to system resources not on the initialized bus. Accordingly, a bus initialization error, or similar error, can result in a cascade failure in which the entire system deadlocks.
It should be appreciated that the above described situation in which a bus initialization error results in a system lock-up requiring a hard reset is undesirable in a high availability (HA) system. Moreover, such a result does not provide for a system “crash-dump” or dumping all the memory to disk or other media to facilitate the operating system (OS) determining the cause of the error, but instead requires a system initialization which does not allow analysis of the state of the system at the time of the error.