1. Field of the Invention
The present invention generally relates to microprocessors. More particularly, the present invention relates to a pipelined, multithreaded processor that can execute a program in at least two separate, redundant threads. More particularly still, the invention relates to a method and apparatus for ensuring valid replication of external interrupts.
2. Background of the Invention
Solid state electronics, such as microprocessors, are susceptible to transient hardware faults. For example, cosmic rays can alter the voltage levels that represent data values in microprocessors, which typically include tens or hundreds of thousands of transistors. Cosmic radiation can change the state of individual transistors causing faulty operation. Faults caused by cosmic radiation typically are temporary and the transistors eventually switch back to their normal state. The frequency of such transient faults is relatively lowxe2x80x94typically less than one fault per year per thousand computers. Because of this relatively low failure rate, making computers fault tolerant currently is attractive more for mission-critical applications, such as online transaction processing and the space program, than computers used by average consumers. However, future microprocessors will be more prone to transient faults due to their smaller anticipated size, reduced voltage levels, higher transistor count, and reduced noise margins. Accordingly, even low-end personal computers may benefit from being able to protect against such faults.
One way to protect solid state electronics from faults resulting from cosmic radiation is to surround the potentially effected electronics by a sufficient amount of concrete. It has been calculated that the energy flux of the cosmic rays can be reduced to acceptable levels with six feet or more of concrete surrounding the computer containing the chips to be protected. For obvious reasons, protecting electronics from faults caused by cosmic ray with six feet of concrete usually is not feasible. Further, computers usually are placed in buildings that have already been constructed without this amount of concrete. Other techniques for protecting microprocessors from faults created by cosmic radiation also have been suggested or implemented.
Rather than attempting to create an impenetrable barrier through which cosmic rays cannot pierce, it is generally more economically feasible and otherwise more desirable to provide the affected electronics with a way to detect and recover from a fault caused by cosmic radiation. In this manner, a cosmic ray may still impact the device and cause a fault, but the device or system in which the device resides can detect and recover from the fault. This disclosure focuses on enabling microprocessors (referred to throughout this disclosure simply as xe2x80x9cprocessorsxe2x80x9d) to recover from a fault condition. One technique, such as that implemented in the Compaq Himalaya system, includes two identical xe2x80x9clocksteppedxe2x80x9d microprocessors. Lockstepped processors have their clock cycles synchronized and both processors are provided with identical inputs (i.e., the same instructions to execute, the same data, etc.). A checker circuit compares the processors"" data output (which may also include memory addressed for store instructions). The output data from the two processors should be identical because the processors are processing the same data using the same instructions, unless of course a fault exists. If an output data mismatch occurs, the checker circuit flags an error and initiates a software or hardware recovery sequence. Thus, if one processor has been affected by a transient fault, its output likely will differ from that of the other synchronized processor. Although lockstepped processors are generally satisfactory for creating a fault tolerant environment, implementing fault tolerance with two processors takes up valuable real estate.
A pipelined, simultaneous multithreaded, out-of-order processor generally can be lockstepped. A xe2x80x9cpipelinedxe2x80x9d processor includes a series of functional units (e.g., fetch unit, decode unit, execution units, etc.), arranged so that several units can be simultaneously processing an appropriate part of several instructions. Thus, while one instruction is being decoded, an earlier fetched instruction can be executed. A xe2x80x9csimultaneous multithreadedxe2x80x9d (xe2x80x9cSMTxe2x80x9d) processor permits instructions from two or more different program threads (e.g., applications) to be processed through the processor simultaneously. An xe2x80x9cout-of-orderxe2x80x9d processor permits instructions to be processed in an order that is different than the order in which the instructions are provided in the program (referred to as xe2x80x9cprogram orderxe2x80x9d). Out-of-order processing potentially increases the throughput efficiency of the processor. Accordingly, an SMT processor can process two programs simultaneously.
An SMT processor can be modified so that the same program is simultaneously executed in two separate threads to provide fault tolerance within a single processor. Such a processor is called a simultaneously and redundantly threaded (xe2x80x9cSRTxe2x80x9d) processor. Some of the modifications to turn a SMT processor into an SRT processor are described in Provisional Application Serial No. 60/198,530.
Executing the same program in two different threads permits the processor to detect faults such as may be caused by cosmic radiation, noted above. By comparing the output data from the two threads at appropriate times and locations within the SRT processor, it is possible to detect whether a fault has occurred. For example, data written to cache memory or registers that should be identical from corresponding instructions in the two threads can be compared. If the output data matches, there is no fault. Alternatively, if there is a mismatch in the output data, a fault has occurred in one or both of the threads.
Although an SRT processor can provide lockstepped execution of redundant threads, forcing the programs to remain lockstepped imposes significant performance penalties. The performance suffers because the two threads are always competing for the same resources, so that no intelligent resource sharing is allowed. The two threads will also suffer the same latency caused by cache misses, and will suffer the same penalty for branch misspeculations. As explained in U.S. patent application Ser. No. 09/584,034, the performance of an SRT processor can be significantly enhanced by eliminating the lockstep requirement and introducing some slack between the execution of the threads. Each of the threads then gains a statistically improved access to processor resources, and is able to benefit in the normal way from out-of-order instruction execution. In addition, the trailing thread is allowed to avoid suffering any cache miss latency if the slack is chosen properly. Further, the branch information from the leading thread is provided to the trailing thread, so that the trailing thread is able to avoid any branch misspeculation. Whenever the slack between the two threads falls below some threshold, the instruction fetch circuitry preferentially fetches more instructions for the leading thread. The net result is faster execution for both threads, and an overall average performance improvement of about 16% has been achieved.
FIG. 1 shows a conceptual model which can be applied to a fault-tolerant system. The system is divided into a sphere of replication 10 and the rest of the system 12. The sphere of replication 10 represents the portion of the system that provides fault protection by duplication. This would include, for example, lockstepped processors (duplicate hardware) or SRT processors (duplication of execution). In FIG. 1, the duplication is shown by redundant execution copies 18, 19. The portion 12 of the system outside the sphere of replication 10 is protected by means other than duplication. Portion 12 generally includes system memory and disk storage, and often includes cache memories. These elements are commonly protected against faults by parity checks or other error correction coding techniques.
The two portions of the system are conceptually coupled by an input replicator 14, and an output comparator 16. The input replicator 14 provides both of the redundant execution copies 18, 19 with identical values, and the output comparator 16 verifies that the output values match before it allows information to be sent to the rest of the system 12. This prevents any faults inside the sphere of replication 10 from propagating to the rest of the system, and it provides an opportunity for fault detection. Upon detecting a fault, the comparator 16 preferably initiates some kind of fault recovery procedure.
In a synchronous, lockstep system, input replicator 14 and output comparator 16 are so straightforward as to be almost overlooked. At any given clock cycle, the same input is provided to both execution copies 18, 19, and the outputs from both copies 18, 19 are compared for verification. However, the system of U.S. patent application Ser. No. 09/584,034 presents several issues that the replicator and comparator implementations must address. These include a variable slack between inputs and outputs for the execution copies, variable orders of inputs and outputs for the execution copies, and branch misspeculation by only the leading thread. Accordingly, it would be desirable to provide an input replicator implementation that addresses these issues in an efficient manner.
The problems noted above are in part solved by a processor having an instruction fetch unit that accounts for slack between threads when initiating interrupt service routines in the threads. In one embodiment, the processor comprises: instruction execution circuitry, a counter, and an instruction fetch unit. The instruction execution circuitry executes instructions in a leading thread and a redundant, trailing thread. The counter tracks the difference between the leading and trailing thread in terms of the number of instructions committed by the instruction execution circuitry. The instruction fetch unit fetches instructions for the redundant threads. When the processor receives an external interrupt signal, the instruction fetch unit stalls the leading thread until the counter indicates that the threads are synchronized, and then simultaneously initiates an interrupt service routine in each of the threads. In a second embodiment similar to the first, the instruction fetch unit does not stall the leading thread, but rather, immediately initiates the interrupt service routine in the leading thread, and copies the difference to an interrupt counter. The interrupt counter is decremented as instructions are committed by the second thread, and when the counter reaches zero, the fetch unit initiates the interrupt service routine in the trailing thread.
In a third embodiment, counters are provided for each thread and used to track the number of instructions committed by each thread. When an interrupt is detected, an interrupt service routine is initiated in the leading thread without delay, and the counter value for the leading thread is placed in an interrupt queue. When the counter for the trailing thread matches the value in the queue, the interrupt service routine is initiated in the trailing thread.
In alternative embodiments, the counters may count numbers of fetched non-speculative instructions (or differences thereof) rather than numbers of committed instructions. The present invention further contemplates methods for implementing the above embodiments.