As technology improves and the number of transistors per chip continues to grow exponentially, CPU microprocessors have more cores and integrate more functions into their uncore (i.e., parts of the microprocessor that are not part of the core). In addition, with increasing processor speeds, processors today execute an ever growing number of instructions. As a result, the potential for errors increases. Meanwhile, it remains desirable to provide the same CPU chip availability despite the potential for an increased error rate.
Instructions typically use resources, such as a physical register file (PRF), for performing operations included in the instructions. Sometimes, however, a PRF becomes corrupted, which can result in an execution error. To account for such errors, conventional systems apply parity error detection on the PRF. In parity error detection, data written to a register will have an additional parity bit included. The parity error detection will check the state of the parity error (usually even or odd), which corresponds to the integrity of the data. If a parity error is detected, conventional systems will typically trigger a machine check error (MCE). Unfortunately, a MCE, is usually a catastrophic failure that generally requires a restart of the processor and may also result in loss of data. Hence, MCEs not the preferred solution and their occurrence should be minimized.
In addition, mission critical (MC) systems as well as larger and larger high performance computing (HPC) systems being built, with tens of thousands or hundreds of thousands CPU sockets working on an application, require an uptime of weeks or longer in order to complete their tasks. The MTBF (Mean Time Between Failures) requirements for those systems are very challenging. Using techniques like chip lockstep or core lockstep for error detection is not only difficult to implement, but almost doubles the CPU power consumption in the system.
As achieving the desired CPU availability starts with error detection, there is a need for efficient, low cost error detection mechanisms that detect all error types (soft, hard, etc.) in the storage elements, as well as in the logic blocks of the CPU chips. Accordingly, a mechanism for detection of errors in CPU cores' integer execution units executing scalar integer operations, floating-point (FP) execution units executing scalar FP operations, SIMD operations, address generation (AGUs), etc. is provided.