Not applicable.
1. Field of the Invention
The present invention generally relates to microprocessors. More particularly, the present invention relates to a pipelined, multithreaded processor that can process the same instruction set in at least two separate threads. More particularly still, the invention relates to imposing xe2x80x9cslackxe2x80x9d between corresponding instructions in the threads of a multithreaded processor to improve the processor""s performance.
2. Background of the Invention
Solid state electronics, such as microprocessors, are susceptible to transient hardware faults. For example, cosmic rays can alter the voltage levels that represent data values in microprocessors, which typically include tens or hundreds of thousands of transistors. Cosmic radiation can change the state of individual transistors causing faulty operation. Faults caused by cosmic radiation typically are temporary and the transistors eventually switch back to their normal state. The frequency of such transient faults is relatively lowxe2x80x94typically less than one fault per year per thousand computers. Because of this relatively low failure rate, making computers fault tolerant currently is attractive more for mission-critical applications, such as online transaction processing and the space program, than computers used by average consumers. However, future microprocessors will be more prone to transient fault due to their smaller anticipated size, reduced voltage levels, higher transistor count, and reduced noise margins. Accordingly, even low-end personal computers may benefit from being able to protect against such faults.
One way to protect solid state electronics from faults resulting from cosmic radiation is to surround the potentially effected electronics by a sufficient amount of concrete. It has been calculated that the energy flux of the cosmic rays can be reduced to acceptable levels with six feet or more of concrete surrounding the computer containing the chips to be protected. For obvious reasons, protecting electronics from faults caused by cosmic ray with six feet of concrete usually is not feasible. Further, computers usually are placed in buildings that have already been constructed without this amount of concrete. Other techniques for protecting microprocessors from faults created by cosmic radiation also have been suggested or implemented.
Rather than attempting to create an impenetrable barrier through which cosmic rays cannot pierce, it is generally more economically feasible and otherwise more desirable to provide the effected electronics with a way to detect and recover from a fault caused by cosmic radiation. In this manner, a cosmic ray may still impact the device and cause a fault, but the device or system in which the device resides can detect and recover from the fault. This disclosure focuses on enabling microprocessors (referred to throughout this disclosure simply as xe2x80x9cprocessorsxe2x80x9d) to recover from a fault condition. One technique, such as that implemented in the Compaq Himalaya system, includes two identical xe2x80x9clocksteppedxe2x80x9d microprocessors. Lockstepped processors have their clock cycles synchronized and both processors are provided with identical inputs (i.e., the same instructions to execute, the same data, etc.). A checker circuit compares the processors"" data output (which may also include memory addressed for store instructions). The output data from the two processors should be identical because the processors are processing the same data using the same instructions, unless of course a fault exists. If an output data mismatch occurs, the checker circuit flags an error and initiates a software or hardware recovery sequence. Thus, if one processor has been affected by a cosmically-created fault, its output likely will differ from that of the other synchronized processor. Although lockstepped processors are generally satisfactory for creating a fault tolerant environment, implementing fault tolerance with two processors takes up valuable real estate.
A pipelined, simultaneous multithreaded, out-of-order processor generally can be lockstepped. A xe2x80x9cpipelinedxe2x80x9d processor includes a series of functional units (e.g., fetch unit, decode unit, execution units, etc.), arranged so that several units can be simultaneously processing an appropriate part of several instructions. Thus, while one instruction is being decoded, an earlier fetched instruction can be executed. A xe2x80x9csimultaneous multithreadedxe2x80x9d (xe2x80x9cSMTxe2x80x9d) processor permits instructions from two or more different program threads (e.g., applications) to be processed through the processor simultaneously. An xe2x80x9cout-of-orderxe2x80x9d processor permits instructions to be processed in an order that is different than the order in which the instructions are provided in the program (referred to as xe2x80x9cprogram orderxe2x80x9d). Out-of-order processing potentially increases the throughput efficiency of the processor. Accordingly, an SMT processor can process two programs simultaneously. It is generally possible to cycle lock step on SMT processor.
An SMT processor can be modified so that the same program is simultaneously executed in two separate threads to provide fault tolerance within a single processor. Such a processor is called a simultaneously and redundantly threaded (xe2x80x9cSRTxe2x80x9d) processor. Some of the modifications to turn a SMT processor into an SRT processor are described in U.S. Provisional Application Serial No. 60/198,530. Executing the same program in two different threads permits the processor to detect faults such as may be caused by cosmic radiation, noted above. By comparing the output data from the two threads at appropriate times and locations within the SRT processor, it is possible to detect whether a fault has occurred. For example, data written to cache memory or registers that should be identical from corresponding instructions in the two threads can be compared. If the output data matches, there is no fault. Alternatively, if there is a mismatch in the output data, a fault has occurred in one or both of the threads.
Executing the same program in two separate threads advantageously affords the SRT processor some degree of fault tolerance, but also may cause several performance problems. For instance, any latency caused by a cache miss is exacerbated. Cache misses occur when an instruction requests data from memory that is not also available in cache memory. The processor first checks whether the requested data already resides in the faster access cache memory, which generally is onboard the processor die. If the requested data is not present in cache (a condition referred to as a cache xe2x80x9cmissxe2x80x9d), then the processor is forced to retrieve the data from main system memory which takes more time, thereby causing latency, than if the data could have been retrieved from the faster onboard cache. Because the two threads are executing the same instructions, any instruction in one thread that results in a cache miss will also experience the same cache miss when that same instruction is executed in other thread. That is, the cache latency will be present in both threads.
A second performance problem concerns branch misspeculation. A branch instruction requires program execution either to continue with the instruction immediately following the branch instruction if a certain condition is met, or branch to a different instruction if the particular condition is not met. Accordingly, the outcome of a branch instruction is not known until the instruction is executed. In a pipelined architecture, a branch instruction (or any instruction for that matter) may not be executed for at least several, and perhaps many, clock cycles after the branch instruction is fetched by the fetch unit in the processor. In order to keep the pipeline full (which is desirable for efficient operation), a pipelined processor includes branch prediction logic which predicts the outcome of a branch instruction before it is actually executed (also referred to as xe2x80x9cspeculatingxe2x80x9d). Branch prediction logic generally bases its speculation on short or long term history. As such, using branch prediction logic, a processor""s fetch unit can speculate the outcome of a branch instruction before it is actually executed. The speculation, however, may or may not turn out to be accurate. That is, the branch predictor logic may guess wrong regarding the direction of program execution following a branch instruction. If the speculation proves to have been accurate, which is determined when the branch instruction is executed by the processor, then the next instructions to be executed have already been fetched and are working their way through the pipeline.
If, however, the branch speculation turns out to have been the wrong prediction (referred to as xe2x80x9cmisspeculationxe2x80x9d), many or all of the instructions filling the pipeline behind the branch instruction may have to be thrown out (i.e., not executed) because they are not the correct instructions to be executed after the branch instruction. The result is a substantial performance hit as the fetch unit must fetch the correct instructions to be processed through the pipeline. Suitable branch prediction methods, however, result in correct speculations more often than misspeculations and the overall performance of the processor is improved with a suitable branch predictor (even in the face of some misspeculations) than if no speculation was available at all.
In an SRT processor that executes the same program in two different threads for fault tolerance, any branch misspeculation is exacerbated because both threads will experience the same misspeculation. Because, the branch misspeculation occurs in both threads, the processor""s internal resources usable to each thread are wasted while the wrong instructions are replaced with the correct instructions.
Of course, it is always desirable to improve the efficiency in a processor. Accordingly, any increase in efficiency, and thus speed, of an SRT processor is highly desirable. Similarly, improvements in the efficiency of a simultaneous multithreaded processor capable of executing the same instruction set as two separate threads for fault tolerance also is desirable.
The problems noted above are solved in large part by a simultaneous and redundantly threaded processor that can simultaneously execute the same program in two separate threads to provide fault tolerance. By simultaneously executing the same program twice, the system can be made fault tolerant by checking the output data pertaining to corresponding instructions in the threads to ensure that the data matches. A data mismatch indicates a fault in the processor effecting one or both of the threads. The preferred embodiment of the invention provides an increase in performance to such a fault tolerant, simultaneous and redundantly threaded processor.
In accordance with the preferred embodiment of the invention, one thread is processed ahead of the other thread thereby creating a xe2x80x9cslackxe2x80x9d of instructions between the two threads so that the instructions in one thread are processed through the processor""s pipeline ahead of the corresponding instructions from the other thread. The thread, whose instructions are processed earlier, is called the xe2x80x9cleadingxe2x80x9d thread, while the other thread is the xe2x80x9ctrailingxe2x80x9d thread. By setting the amount of slack (in terms of numbers of instructions) appropriately, all or at least some of the cache misses or branch misspeculations encountered by the leading thread can be resolved before the corresponding instructions from the trailing thread are fetched and processed through the pipeline.
A cache miss in the leading thread resulting, for example, from a store instruction will cause the requested data to be stored in the cache. Then, when the same store instruction in the trailing thread is processed, the requested data will already reside in cache and no cache miss in the trailing thread will occur, thereby reducing latency. Similarly, any branch misspeculation in the leading thread will not occur in the trailing thread because the branch instruction will have been resolved in the leading thread by the time that same instruction is fetched in the trailing thread.
The amount of slack preferably is programmable. Programming more slack into the system provides the leading thread a chance to bring data into the cache and resolve branch misspeculations before the corresponding instructions in the trailing thread are processed through the processor""s pipeline. However, excessively long slacks can reduce system performance. A desirable amount of slack will vary from system to system and application to application. A slack of 256 instructions, for example, has been found to significantly improve system performance.