A combination of hardware and software components in computer systems today has progressed to a point such that these computer systems can be highly reliable. Reliability in computer systems may be provided by using redundant components. In some computer systems, for example, components such as node controllers that manage hardware error requests that nodes of the computer system are provided in redundant pairs—one primary node controller and one redundant (backup) node controller. When such a primary node controller fails, the redundant node controller takes over the primary node controller's operations. Redundant pairs can also be used for system controllers for the same purpose. Node controllers and system controllers may also be referred to as service processors. A service processor is the component in a distributed computer system that provides operation tasks such as initialization, configuration, run-time error detection, diagnostics and correction, as well as closely monitoring other hardware components for failures.
A system dump is the recorded state of the working memory of a redundant node controller at a specific time, such as when a program running on the redundant node controller has determined a loss of communications with the system controller. First failure data capture (FFDC) is a minimum set of information related to a certain error detected by a node and/or system controller. Debug dump data is a superset of FFDC, and it includes all information from the controller, including information that may not be directly relevant to the specific error investigation. When an error occurs in one of the nodes, the dump of debug information is captured immediately from the primary node controller for further analysis. However, the backup node controller may become aware of the error only if the primary fails and consequently the backup takes over as primary. This process is called failover. Waiting for the failover process to be completed to capture the dump may delay the dump of the debug information and negatively impact the ability to analyze the error.