1. Field of the Invention
This invention relates to computing systems, and more particularly, to efficient reliable execution on a simultaneous multithreading machine.
2. Description of the Relevant Art
Mission critical software applications require high reliability. Some examples of these applications include financials and banking software, databases, and military applications. Software testing methods may be used to verify and validate a software application to a predetermined level of quality. However, problems may arise due to the hardware platform utilized to execute the application such as the microprocessor. Although a microprocessor may have been previously tested to meet predetermined quality requirements, as with the software application, testing under all combinations of inputs and preconditions, such as an initial state, is not feasible. In addition, besides functional errors, modern microprocessors may experience both hard errors, such as stuck-at faults, and soft errors such as radiation induced errors on storage nodes.
With both the node capacitance and the supply voltage decreasing over time with the next generations of new processors, the amount of electrical charge stored on a node decreases. Due to this fact, nodes are more susceptible to radiation induced soft errors caused by high energy particles such as cosmic rays, alpha particles, and neutrons. This radiation creates minority carriers at the source and drain regions of transistors to be transported by the source and drain diodes. The change in charge stored on a node compared to the total charge, which is decreasing with each generation, may be a large enough percentage that it surpasses the circuit's noise margin and alters the stored state of the node. Although the circuit is not permanently damaged by this radiation, a logic failure may occur.
For the above reason, memories such as static random access memory (SRAM) use error correcting code (ECC) to detect and correct soft errors. Sequential elements, such as flip-flops, may use larger capacitance nodes or redundant latches within their design in order to combat soft errors. However, nodes within combinatorial logic, such as integer and floating-point functional units, are also susceptible to soft errors. Therefore, testing that guards against functional errors and hard errors has not proven that combinatorial logic is safe against soft errors, which may be unacceptable for mission critical applications.
Regardless whether an error is due to failed functionality, a hard error, or a soft error, a mission critical application may have a low tolerance of an occurrence of any error and may not allow for repeat execution with a particular data set. In order to ensure correct operation of an application on particular hardware and to detect an error, two parallel executions of the application may be run with checkpoints. At each checkpoint, a comparison may be performed of resulting data of each execution that should be the same. Thus, the simultaneous executions are running in lockstep. Any difference detected by a comparison at a checkpoint may flag an error. Operation of the simultaneous executions may roll back for both executions to the previous successful checkpoint, and the parallel executions may be re-run from the checkpoint. Also, a flag or warning may be reported to a user. A user may decide to re-run the executions to see if a difference is found again at the problematic checkpoint or may decide to debug the application at the time a difference in resulting data is determined.
However, it may be difficult to perform efficient parallel lockstep execution. For example, using two microprocessors, wherein each microprocessor executes a copy of the application simultaneously and begins execution at the same time as the other microprocessor, may not provide lockstep execution due to reasons such as unequal direct memory access (DMA) times and unequal refresh operations. Therefore, it may be more advantageous to use one microprocessor with copies of hardware and functional units.
Many modern microprocessors utilize copies of cores in order to implement multi-threading operation, wherein each core may independently operate on a separate software thread simultaneously with other cores.
One manner to achieve lockstep execution of a mission critical application is to execute the application and a copy of the application simultaneously on two copies of a core within a microprocessor. If each core receives the same instruction, such as the original instruction in a first core and a twin copy of the instruction in a second core, then a comparison of relevant data may be performed in each clock cycle. Therefore, lockstep execution of the mission critical application may be achieved.
However, different factors may interrupt this lockstep execution. For example, not all hardware resources may be copied in order to achieve multi-threading operation within a microprocessor. A floating-point unit (FPU) contains complex logic that consumes a lot of on-die real estate. Also, FPU operations are not performed often. Therefore, a designer is not motivated to create independent expensive copies of floating-point logic on the die. Rather, multi-threading operation for a FPU and possibly other hardware resources on-die may be achieved by simultaneous multi-threading (SMT).
As with multi-threading, in SMT, instructions from more than one thread can be executing in any given pipeline stage at a time and may be used to hide memory latency and increase throughput of computations per amount of hardware used. However, SMT works by duplicating certain sections of the processor, such as those that store the architectural state, but not duplicating the main execution resources. This allows a SMT equipped processor to pretend to appear as two “logical” processors to the host operating system. The operating system may schedule two or more threads or processes simultaneously. Where execution resources in a non-SMT capable processor are not used by a current thread, and especially when the processor is stalled due to a cache miss, a branch misprediction, or other, a SMT equipped processor may use those execution resources to execute another scheduled thread.
The SMT hardware, such as a FPU, does not perform operations of two threads in lockstep. Therefore, any communication with hardware copies, such as two independent integer cluster copies, interrupts lockstep execution within the two integer clusters. Further, in order not to decrease performance of the microprocessor when it is not operating in a reliable execution mode, it is not desirable to modify any schedulers and renaming logic, and to route signals between the integer clusters in order to synchronize non-lockstep received signals from the FPU.
In view of the above, efficient methods and mechanisms for efficient reliable execution on a simultaneous multithreading machine.