The present invention relates to a method and a system for checkpointing a multiple processor data processing system in order to provide for error-recovery.
To allow high instruction level parallelism in modern processors several instructions can be executed and finally retired in parallel. This is essential, if complex instructions of a CISC processor are translated in several simpler RISC like instructions and if the number of instructions which are executed per cycle (IPC) should be high. Retiring of these instructions means that the contents of the architected register array are updated with the result of the internal instructions and the corresponding store data are written back into the cache/memory. In order to reflect the instruction sequence given by a program the retirement, i.e. completion of instructions occurs in conceptual order. Thus the terms “younger” and “older” instructions represent instructions found later or earlier, respectively, in an instruction sequence. Checkpointing means, that snapshots of the state of the architected registers and the corresponding data stored in the data cache, are taken at a certain frequency, i.e. a fixed time interval. Highest resolution is obtained if the snapshots are taken every cycle.
Such a prior art checkpointing method is disclosed in U.S. Pat. No. 5,418,916. A checkpoint retry facility utilizes a store buffer in order to establish a store queue during normal operation and for providing the data necessary for checkpoint retry during a retry operation. The data buffered therein also includes the register data of the floating point registers, the general-purpose registers and the access registers, and the program status word.
This is basically done with the help of a plurality of store buffers associated with the L1-Cache of each of the processing units. Each of the store buffers is used as an intermediate buffer for holding the storage data until such data can be released to any other portions of the storage hierarchy where other CPUs can then access the data.
In order to control the release of storage data two information bits are installed in the store queue design: the “end of instruction” (EOI) bit and the “checkpoint complete” (COMP) bit. The data in the store buffer is available only to the processor directly associated with it. Other processors cannot access this data until it is written to the L2-cache or the memory, which is public to all other processors. This prior art approach, however, has some weaknesses when it is required to checkpoint more than one external instruction (CISC) per cycle: At most, a single instruction can be checkpointed per cycle.