1. Field of the Invention
The field of the invention is data processing, or, more specifically, methods, apparatus, and products for recovery of a redundant node controller in a computer system.
2. Description of Related Art
The development of the Electronic Discrete Variable Automatic Computer (‘EDVAC’) computer system of 1948 is often cited as the beginning of the computer era. Since that time, computer systems have evolved into extremely complicated devices. Today's computers are much more sophisticated than early systems such as the EDVAC. Computer systems typically include a combination of hardware and software components, application programs, operating systems, processors, buses, memory, input/output devices, and so on. As advances in semiconductor processing and computer architecture push the performance of the computer higher and higher, more sophisticated computer software has evolved to take advantage of the higher performance of the hardware, resulting in computer systems today that are much more powerful than just a few years ago.
The combination of hardware and software components in computer systems today has progressed to the point that computer systems can be highly reliable. Reliability in computer systems may be provided by using redundant components in the computer system. When one component fails another component replaces it. In some computer systems, for example, components such as node controllers that manage hardware error requests in nodes of the computer system are provided in redundant pairs—one primary node controller, one redundant node controller. When such a primary node controller fails, the redundant node controller takes over the primary node controller's operations.
From time to time a redundant node controller loses communication with other components in the computer system. Typical methods of recovery of the redundant node controllers are reactive. That is, recovery of the redundant node controller is not attempted until the redundant node controller is called upon to replace the primary node controller. Recovery of the redundant node controller at this point is typically too late for reliable operations of the node controllers. Because the redundant node controller cannot communicate with other components in the computer system when called upon to replace the primary node controller, the redundant node controller cannot operate effectively as the primary node controller. Reactive recovery of redundant node controllers therefore reduces the reliability of node controllers in a computer system.
In other methods of recovery of redundant node controllers both the redundant node controller and the component with which the redundant node controller lost communication must agree on the failure before attempting recovery of the redundant node controller. Typically, however, one of the components is unaware of the loss of communication due to software errors. In such cases, recovery of redundant node controller is not even attempted. Readers of skill in the art will recognize therefore that there exists room for improvement in recovery of a redundant node controller in a computer system