At the heart of many computer systems is the microprocessor or central processing unit (CPU) (referred to collectively as the “processor.”) The processor performs most of the actions responsible for application programs to function. The execution capabilities of the system are closely tied to the CPU: the faster the CPU can execute program instructions, the faster the system as a whole will execute.
Early processors executed instructions from relatively slow system memory, taking several clock cycles to execute a single instruction. They would read an instruction from memory, decode the instruction, perform the required activity, and write the result back to memory, all of which would take one or more clock cycles to accomplish.
As applications demanded more power from processors, internal and external cache memories were added to processors. A cache memory (hereinafter cache) is a section of very fast memory located within the processor or located external to the processor and closely coupled to the processor. Blocks of instructions or data are copied from the relatively slower system memory (DRAM) to the faster cache memory where they can be quickly accessed by the processor.
Cache memories can develop persistent errors over time, which degrade the operability and functionality of their associated CPU's. In such cases, physical removal and replacement of the failed or failing cache memory has been performed. Moreover, where the failing or failed cache memory is internal to the CPU, physical removal and replacement of the entire CPU module or chip has been performed. This removal process is generally performed by field personnel and results in greater system downtime.
Some computer systems use multiple CPUs concurrently. If a CPU fails during operation, it can cause severe problems for the applications that are running at the time of failure. Accordingly, it is desirable to determine how healthy each CPU is in order to remove unhealthy CPUs before they fail.