Not applicable.
1. Field of the Invention
The present invention relates generally to microprocessors. More particularly, the present invention relates to a pipelined, simultaneously and redundantly threaded processor adapted to execute the same instruction set in at least two separate threads for transient fault detection purposes. More particularly still, the invention relates to detecting transient faults between the multiple processor threads by comparison of their uncached load requests, and a data value replication system for insuring each thread receives the same uncached load data value.
2. Background of the Invention
Solid state electronics, such as microprocessors, are susceptible to transient hardware faults. For example, cosmic radiation can alter the voltage levels that represent data values in microprocessors, which typically include tens or hundreds of thousands of transistors. The changed voltage levels change the state of individual transistors, causing faulty operation. Faults caused by cosmic radiation typically are temporary and the transistors eventually operate normally again. The frequency of such transient faults is relatively lowxe2x80x94typically less than one fault per year per thousand computers. Because of this relatively low failure rate, making computers fault tolerant currently is attractive more for mission-critical applications, such as online transaction processing and the space program, than computers used by average consumers. However, future microprocessors will be more prone to transient fault due to their smaller anticipated size, reduced voltage levels, higher transistor count, and reduced noise margins. Accordingly, even low-end personal computers benefit from being able to protect against such faults.
One way to protect solid state electronics from faults resulting from cosmic radiation is to surround the potentially effected electronics by a sufficient amount of concrete. It has been calculated that the energy flux of the cosmic radiation can be reduced to acceptable levels with at least six feet of concrete surrounding the chips to be protected. For obvious reasons, protecting electronics from faults caused by cosmic radiation with six feet of concrete usually is not feasible as computers are usually placed in buildings that have already been constructed without this amount of concrete. Because of the relatively low occurrence rate, other techniques for protecting microprocessors from faults created by cosmic radiation have been suggested or implemented that merely check for and correct the transient failures when they occur.
Rather than attempting to create an impenetrable barrier through which cosmic rays cannot pierce, it is generally more economically feasible and otherwise more desirable to provide the effected electronics with a way to detect and recover from faults caused by cosmic radiation. In this manner, a cosmic ray may still impact the device and cause a fault, but the device or system in which the device resides can detect and recover from the fault. This disclosure focuses on enabling microprocessors (referred to throughout this disclosure simply as xe2x80x9cprocessorsxe2x80x9d) to recover from a fault condition.
One technique for detecting transient faults is implemented in the Compaq Himalaya system. This technique includes two identical xe2x80x9clocksteppedxe2x80x9d microprocessors that have their clock cycles synchronized, and both processors are provided with identical inputs (i.e., the same instructions to execute, the same data, etc.). In the Compaq Himalaya system, each input to the processors, and each output from the processors, is verified and checked for any indication of a transient fault. That is, the hardware of the Himalaya system verifies all signals going to and leaving the Himalaya processors at the hardware signal levelxe2x80x94the voltage levels on each conductor of each bus are compared. The hardware performing these checks and verifications is not concerned with the particular type of instruction it is comparing; rather, it is only concerned that two digital signals match. Thus, there is significant hardware and spatial overhead associated with performing transient fault detection by lockstepping duplicate processors in this manner.
The latest generation of high-speed processors achieve some of their processing speed advantage through the use of a xe2x80x9cpipeline.xe2x80x9d A xe2x80x9cpipelinedxe2x80x9d processor includes a series of units (e.g., fetch unit, decode unit, execution units, etc.), arranged so that several units can simultaneously process an appropriate part of several instructions. Thus, while one instruction is decoded, an earlier fetched instruction is executed. These instructions may come from one or more threads. Thus, a xe2x80x9csimultaneous multithreadedxe2x80x9d (xe2x80x9cSMTxe2x80x9d) processor permits instructions from two or more different program threads (e.g., applications) to be processed simultaneously. However, it is possible to cycle lockstep the threads of an SMT processor to achieve fault tolerance.
An SMT processor can be modified so that the same program is simultaneously executed in two separate threads to provide fault tolerance within a single processor. Such a processor is called a simultaneous and redundantly threaded (xe2x80x9cSRTxe2x80x9d) processor. Some of the modifications to turn a lockstep SMT processor into an SRT processor are described in Provisional Application Ser. No. 60/198,530. However, to utilize known transient fault detection requires that each thread of the SRT processor be lockstepped (as opposed to having two SRT processors lockstepped to each other). Hardware within the processor itself (in the Himalaya, the hardware is external to each processor) must verify the digital signals on each conductor of each bus. While increasing processor performance and yet still doing transient fault protection in this manner may have advantages over previous fault detecting systems, SRT processor performance can be enhanced.
One such performance enhancing technique is to allow each processor to run independently. More particularly, one thread is allowed to execute program instructions ahead of the second thread. In this way, memory fetches and branch speculations resolve ahead of time for the trailing thread. However, verifying, at the signal level, each input and output of each thread becomes complicated when the threads are not lockstepped (executing the same instruction at the same time).
A second performance enhancing technique for pipelined computers is an xe2x80x9cout-of-orderxe2x80x9d processor. In an out-of-order processor each thread need not execute the program in the order it is presented; but rather, each thread may execute program steps out of sequence. The technique of fault tolerance by verifying bus voltage patterns between the two threads becomes increasingly difficult when each thread is capable of out-of-order processing. The problem is further exacerbated if the one processor thread leads in overall processing location within the executed program. In this situation not only would the leading thread be ahead, but this thread may also execute the instructions encountered in a different sequence than the trailing thread.
The final performance enhancing technique of SRT processor is speculative branch execution. In speculative branch execution a processor effectively guesses the outcome of a branch in the program thread and executes subsequent steps based on that speculation. If the speculation was correct, the processor saves significant time (for example, over stalling until the branch decision is resolved). In the case of an SRT processor it is possible that each thread makes speculative branch execution different than the other. Thus, it is impossible to do transient fault protection using known techniquesxe2x80x94verifying digital signals on each busxe2x80x94because it is possible there may be no corresponding signal between two threads.
What is needed is an SRT processor that can achieve performance gains over an SRT processor in which each thread is lockstepped by using the performance enhancing techniques noted above, and that can also do transient fault detection.
The problems noted above are solved in large part by a simultaneous and redundantly threaded processor that has performance gains over an SRT processor with lockstepped threads and provides transient fault tolerance. The processor checks for transient faults by checking only memory requests (input/output (xe2x80x9cI/Oxe2x80x9d) commands, I/O requests) that directly or indirectly affect data values in system memory. More particularly, the preferred embodiments verify only writes (stores) that change data outside the bounds of the processor and uncached reads, e.g., a read from a virtual address space mapped to an I/O device. Because this transient fault detection does not need to verify every input and output at the signal level, the transient fault protection extends to the threaded xe2x80x9cout-of-orderxe2x80x9d processors, processors whose threads perform independent speculative branch execution, and processors with leading and lagging thread execution.
An embodiment of the invention comprises a read queue and a compare circuit. The processor thread executing the program ahead, the leading thread, writes its uncached read to the read queue. Subsequently, the processor thread lagging or trailing, the trailing thread, writes its corresponding uncached read or uncached data load request to the queue. A compare circuit periodically scans the read queue looking for the corresponding uncached reads. If the address of the corresponding uncached reads match exactly, then each of the processor threads have operated without fault, and the read is allowed to execute. However, if any differences exist in the address of the uncached reads, the compare circuit initiates a fault recovery sequence.
The preferred embodiment further comprises a data value replication circuit that captures the result of the uncached read, the return data, and replicates that data for use by each of the threads. This insures that each thread uses the same input value in further processing to avoid a later misdiagnosis of a transient fault.
Alternatively, a second embodiment of the invention comprises the read queue into which the leading thread places its uncached read. As the trailing thread reaches this point in the program execution, hardware and firmware associated with that thread compares the uncached read, without placing that uncached read in the same queue as the previous uncached read, and finds the corresponding uncached load from the leading thread. If these two uncached reads match exactly, the uncached read placed in the queue is marked as verified and the trailing thread read is effectively discarded. The verified uncached read is then sent to its appropriate location in the cache or main memory areas.