In todays systems designed to operate with more than one CPU, when a processor detects an error, it attempts to correct the problem by a retry, such as by retrying the instruction in which the error occurred, or by re-executing the program in which the error occurred. Checkpoint retry recovery is available only if a program is designed to store checkpoint data at various times during its execution. The retry techniques are limited to intermittent types of errors, and if a solid error occurs in the hardware, it will persist through all retry attempts, so a maximum number of retries is used and then a solid (uncorrectable) error is declared if the error remains. Detection of a solid error will cause the CPU to generate a machine check (MC) interruption.
The MC interruption signals the system control program and provides a MC new PSW (program status word) which addresses an entry instruction in a recovery manager program within the system control program. The system control program then may attempt to re-execute the interrupted instruction to see if the error condition goes away. If the error condition does not go away, the system control program declares an abnormal end (ABEND) for the task that had its execution terxinated by the error condition in its processor. Dependent on the type of recovery support built into the terminated program, it may or may not be able to recover. Often a program lacks the ability to recover when it is terminated at an unplanned point in its execution, even when it has not lost its input data. And when input data is lost due to an unplanned stoppage before execution is complete, programs using real time data (such as from a teller machine or a process control sensor) cannot recover their input data, and therefore the attempted recovery fails even when an intermittent hardware error is corrected.
The normal CPU operation of executing dispatched tasks is ended by putting the CPU in a checkstopped state (stopping the CPU internal cycle clocks) if the re-execution of an instruction continues to fail, which determines a solid hardware error exists. The operating system software may maintain a retry threshold after which the CPU is checkstopped.
A checkstopped CPU is marked a failed CPU by the system control program, so that it will not have any more program tasks dispatched on it.