1. Field of the Invention
The present invention generally relates to microprocessors. More particularly, the present invention relates to a pipelined, multithreaded processor that can execute a program in at least two separate, redundant threads. More particularly still, the invention relates to a method and apparatus for ensuring valid replication of reads from a cycle counter to each redundant thread.
2. Background of the Invention
Solid state electronics, such as microprocessors, are susceptible to transient hardware faults. For example, cosmic rays or alpha particles can alter the voltage levels that represent data values in microprocessors, which typically include millions of transistors. Cosmic radiation can change the state of individual transistors causing faulty operation. The frequency of such transient faults is relatively low—typically less than one fault per year per thousand computers. Because of this relatively low failure rate, making computers fault tolerant currently is attractive more for mission-critical applications, such as online transaction processing and the space program, than computers used by average consumers. However, future microprocessors will be more prone to transient fault due to their smaller anticipated size, reduced voltage levels, higher transistor count, and reduced noise margins. Accordingly, even low-end personal computers may benefit from being able to protect against such faults.
One way to protect solid state electronics from faults resulting from cosmic radiation is to surround the potentially effected electronics by a sufficient amount of concrete. It has been calculated that the energy flux of the cosmic rays can be reduced to acceptable levels with six feet or more of concrete surrounding the computer containing the chips to be protected. For obvious reasons, protecting electronics from faults caused by cosmic ray with six feet of concrete usually is not feasible. Further, computers usually are placed in buildings that have already been constructed without this amount of concrete.
Rather than attempting to create an impenetrable barrier through which cosmic rays cannot pierce, it is generally more economically feasible and otherwise more desirable to provide the affected electronics with a way to detect and recover from a fault caused by cosmic radiation. In this manner, a cosmic ray may still impact the device and cause a fault, but the device or system in which the device resides can detect and recover from the fault. This disclosure focuses on enabling microprocessors (referred to throughout this disclosure simply as “processors”) to recover from a fault condition. One technique, such as that implemented in the Compaq Himalaya system, includes two identical “lockstepped” microprocessors. Lockstepped processors have their clock cycles synchronized and both processors are provided with identical inputs (i.e., the same instructions to execute, the same data, etc.). A checker circuit compares the processors' data output which may also include memory addressed for store instructions). The output data from the two processors should be identical because the processors are processing the same data using the same instructions, unless of course a fault exists. If an output data mismatch occurs, the checker circuit flags an error and initiates a software or hardware recovery sequence. Thus, if one processor has been affected by a transient fault, its output likely will differ from that of the other synchronized processor. Although lockstepped processors are generally satisfactory for creating a fault tolerant environment, implementing fault tolerance with two processors takes up valuable real estate.
A “pipelined” processor includes a series of functional units (e.g., fetch unit, decode unit, execution units, etc.), arranged so that several units can be simultaneously processing an appropriate part of several instructions. Thus, while one instruction is being decoded, an earlier fetched instruction can be executed. A “simultaneous multithreaded” (“SMT”) processor permits instructions from two or more different program threads (e.g., applications) to be processed through the processor simultaneously. An “out-of-order” processor permits instructions to be processed in an order that is different than the order in which the instructions are provided in the program (referred to as “program order”). Out-of-order processing potentially increases the throughput efficiency of the processor. Accordingly, an SMT processor can process two programs simultaneously.
An SMT processor can be modified so that the same program is simultaneously executed in two separate threads to provide fault tolerance within a single processor. Such a processor is called a simultaneous and redundantly threaded (“SRT”) processor. Some of the modifications to turn a SMT processor into an SRT processor are described in Provisional Application Ser. No. 60/198,530.
Executing the same program in two different threads permits the processor to detect faults such as may be caused by cosmic radiation, noted above. By comparing the output data from the two threads at appropriate times and locations within the SRT processor, it is possible to detect whether a fault has occurred. For example, data written to cache memory or registers that should be identical from corresponding instructions in the two threads can be compared. If the output data matches, there is no fault. Alternatively, if there is a mismatch in the output data, a fault has presumably occurred in one or both of the threads.
Executing the same program in two separate threads advantageously affords the SRT processor some degree of fault tolerance, but also may cause several performance problems. For instance, any latency caused by a cache miss is exacerbated. Cache misses occur when an instruction requests data from memory that is not also available in cache memory. The processor first checks whether the requested data already resides in the faster access cache memory, which generally is onboard the processor die. If the requested data is not present in cache (a condition referred to as a cache “miss”), then the processor is forced to retrieve the data from main system memory which takes more time, thereby causing latency, than if the data could have been retrieved from the faster onboard cache. Because the two threads are executing the same instructions, any instruction in one thread that results in a cache miss will also experience the same cache miss when that same instruction is executed in other thread. That is, the cache latency will be present in both threads.
A second performance problem concerns branch misspeculation. A branch instruction requires program execution either to continue with the instruction immediately following the branch instruction if a certain condition is met, or branch to a different instruction if the particular condition is not met. Accordingly, the outcome of a branch instruction is not known until the instruction is executed. In a pipelined architecture, a branch instruction (or any instruction for that matter) may not be executed for at least several, and perhaps many, clock cycles after the branch instruction is fetched by the fetch unit in the processor. In order to keep the pipeline full (which is desirable for efficient operation), a pipelined processor includes branch prediction logic which predicts the outcome of a branch instruction before it is actually executed (also referred to as “speculating”). Branch prediction logic generally bases its speculation on short or long term history. As such, using branch prediction logic, a processor's fetch unit can speculate the outcome of a branch instruction before it is actually executed. The speculation, however, may or may not turn out to be accurate. That is, the branch predictor logic may guess wrong regarding the direction of program execution following a branch instruction. If the speculation proves to have been accurate, which is determined when the branch instruction is executed by the processor, then the next instructions to be executed have already been fetched and are working their way through the pipeline.
If, however, the branch speculation turns out to have been the wrong prediction (referred to as “misspeculation”), many or all of the instructions filling the pipeline behind the branch instruction may have to be thrown out (i.e., not executed) because they are not the correct instructions to be executed after the branch instruction. The result is a substantial performance hit as the fetch unit must fetch the correct instructions to be processed through the pipeline. Suitable branch prediction methods, however, result in correct speculations more often than misspeculations and the overall performance of the processor is improved with a suitable branch predictor (even in the face of some misspeculations) than if no speculation was available at all.
In an SRT processor that executes the same program in two different threads for fault tolerance, any branch misspeculation is exacerbated because both threads will experience the same misspeculation. Because the branch misspeculation occurs in both threads, the processor's internal resources usable to each thread are wasted while the wrong instructions are replaced with the correct instructions.
In an SRT processor, threads may be separated by a predetermined amount of slack to improve performance. In this scenario, one thread is processed ahead of the other thread thereby creating a “slack” of instructions between the two threads so that the instructions in one thread are processed through the processor's pipeline ahead of the corresponding instructions from the other thread. The thread whose instructions are processed earlier is called the “leading” thread, while the other thread is the “trailing” thread. By setting the amount of slack (in terms of numbers of instructions) appropriately, all or at least some of the cache misses or branch misspeculations encountered by the leading thread can be resolved before the corresponding instructions from the trailing thread are fetched and processed through the pipeline.
In an SRT processor, the processor verifies that inputs to the multiple threads are identical to guarantee that both execution copies or threads follow precisely the same path. Thus, corresponding operations that input data from other locations within the system (e.g., memory, cycle counter), must return the same data values to both redundant threads. Otherwise, the threads may follow divergent execution paths, leading to different outputs that will be detected and handled as if a hardware fault occurred.
One potential problem in running two separate, but redundant threads in a computer processor arises in reading the current value in the system cycle counter. A cycle counter is a running counter that advances once for each tick of the processor clock. Thus, for a 1 GHz processor, the counter will advance once every nanosecond. A conventional cycle counter may be a 64-bit counter that counts up from zero to the maximum value and wraps around to zero to continue counting.
A program that is running on the processor may periodically request the current value of the cycle counter using a read or fetch command. For example, Compaq Alpha servers execute an “rpcc” command that is included in the instruction set for Alpha processors. By reading the cycle counter at the start and finish of an instruction or set of instructions, the processor may calculate how many clock cycles (and therefore, how much time) elapsed during execution of the instructions. Thus, the “read cycle counter” command provides a means of measuring system performance.
As discussed above, corresponding instructions in redundant threads are not executed at precisely the same time. Thus, it should be expected that corresponding read cycle count commands from the different threads will always return different values because some amount of time will elapse between the cycle count retrievals. While this cycle count variation between threads may be expected, the different values may result in a fault condition because the inputs to the two threads are different. It is desirable therefore, to develop a method of replicating the cycle count values from the cycle counter for each redundant thread in the pipeline. By replicating the cycle counter value, erroneous transient fault conditions or faulty SRT operation resulting from the trailing “read cycle count” instructions are avoided.