1. Technical Field
The present invention relates to a data processing system. In particular, the present invention relates to processor nodes in a data processing system. Still more particular, the present invention relates to automatic recovery from a failed node concurrent maintenance operation in a data processing system.
2. Description of Related Art
Processor node “HotPlug” or concurrent maintenance is the ability to add or remove a processor node from a fully functional data processing system without disrupting the operating system or software that is running on other processor nodes of the data processing system. A processor node comprises one or more processors, memory, input/output devices, all connected to each other via interconnect cable. In processor architecture like Power6, up to eight processor nodes may be added to the data processing system in one HotPlug. Thus, the ability to HotPlug a node allows a user to service or upgrade a system without costly downtimes caused by system shutdowns and restarts. Power6 processor is a product available from International Business Machines Corporation.
Existing node HotPlug implementations follows three high level steps. First, communication links between all nodes of the data processing system are temporarily disabled. Second, the old configuration settings are switched to new configuration settings if new processor nodes are added to the system or if processor nodes are removed from the system. Third, communication links are initialized to re-enable communication flow between all the nodes in the system. The above three steps are performed in a very short amount of time, since the software that is running in the system hangs if the communication paths between processor nodes are not available for transmission of data.
However, a problem lies in existing HotPlug implementations. If there is a problem with the communication link and the traffic is allowed to flow, data errors may occur. Data errors may result in a fatal error (known as a system or partition checkstop), which causes the loss of processes and data currently running on the system. In addition, this fatal error may cause system downtime, since the whole system or the partition must be rebooted.
Therefore, it would be advantageous to have an improved method, apparatus, and computer instructions for automatic recovery from a failed node HotPlug operation, such that if a communication error occurs between processor nodes, it will not result in system downtime.