As components of SMP computer systems become denser, there are increasingly more ways that these computers can experience hardware faults or errors. In order to avoid system outages of these systems, it is vital that these computers have recovery circuitry to allow for such errors.
For each error event, there is often some data that is logged out with the error for isolation of the failing components so the parts can be replaced. Also, this debug data may be used to help isolate the failure or defect down to the suspected circuit or root cause of the failure.
However, it is not always feasible to build hardware to log out all the relevant failure information non-disruptively for all the possible types of failures in a system. For instance, if there is an interface with ECC protection that experiences correctable errors, it is often necessary to identify the failing parts. In general, only the bus isolation (not the bit isolation) is necessary to determine the part to replace.
There is often hardware that is used to trap error information into registers for future debug. However, getting this data out of the machine non-disruptively (i.e. while the machine continues to run) can lead to complicated or expensive hardware as well as simulation effort to make sure this logging hardware works. The process of logging data in a disruptive manner is often very simple (e.g. via LSSD Scan). Also, it is not always clear which data would prove helpful for debug and which data would not be necessary.
Rather than design more hardware to log all the ancillary data, this invention is used to defer the logging of that data until a disruption occurs. Examples of intentional disruptions are manual power-down, activation power-down, or restart. An example of an unintentional disruption is a system checkstop.
Unfortunately, there is often not a lot of control over when an operator of a machine decides to disrupt a machine, thus losing valuable debug data. Once any of these events occur, the debug data is lost due to power loss or scanning new data into the hardware.