1. Field of the Invention
The present invention generally relates to microprocessors. More particularly, the present invention relates to a pipelined, multithreaded processor that can execute a program in at least two separate, redundant threads. More particularly still, the invention relates to a method and apparatus for ensuring valid replication of loads from a data cache when cache lines are invalidated and load instructions are performed out of order.
2. Background of the Invention
Solid state electronics, such as microprocessors, are susceptible to transient hardware faults. For example, cosmic rays can alter the voltage levels that represent data values in microprocessors, which typically include tens or hundreds of thousands of transistors. Cosmic radiation can change the state of individual transistors causing faulty operation. Faults caused by cosmic radiation typically are temporary and the transistors eventually switch back to their normal state. The frequency of such transient faults is relatively lowxe2x80x94typically less than one fault per year per thousand computers. Because of this relatively low failure rate, making computers fault tolerant currently is attractive more for mission-critical applications, such as online transaction processing and the space program, than computers used by average consumers. However, future microprocessors will be more prone to transient fault due to their smaller anticipated size, reduced voltage levels, higher transistor count, and reduced noise margins. Accordingly, even low-end personal computers may benefit from being able to protect against such faults.
One way to protect solid state electronics from faults resulting from cosmic radiation is to surround the potentially effected electronics by a sufficient amount of concrete. It has been calculated that the energy flux of the cosmic rays can be reduced to acceptable levels with six feet or more of concrete surrounding the computer containing the chips to be protected. For obvious reasons, protecting electronics from faults caused by cosmic ray with six feet of concrete usually is not feasible. Further, computers usually are placed in buildings that have already been constructed without this amount of concrete. Other techniques for protecting microprocessors from faults created by cosmic radiation also have been suggested or implemented.
Rather than attempting to create an impenetrable barrier through which cosmic rays cannot pierce, it is generally more economically feasible and otherwise more desirable to provide the affected electronics with a way to detect and recover from a fault caused by cosmic radiation. In this manner, a cosmic ray may still impact the device and cause a fault, but the device or system in which the device resides can detect and recover from the fault. This disclosure focuses on enabling microprocessors (referred to throughout this disclosure simply as xe2x80x9cprocessorsxe2x80x9d) to recover from a fault condition. One technique, such as that implemented in the Compaq Himalaya system, includes two identical xe2x80x9clocksteppedxe2x80x9d microprocessors. Lockstepped processors have their clock cycles synchronized and both processors are provided with identical inputs (i.e., the same instructions to execute, the same data, etc.). A checker circuit compares the processors"" data output (which may also include memory addressed for store instructions). The output data from the two processors should be identical because the processors are processing the same data using the same instructions, unless of course a fault exists. If an output data mismatch occurs, the checker circuit flags an error and initiates a software or hardware recovery sequence. Thus, if one processor has been affected by a transient fault, its output likely will differ from that of the other synchronized processor. Although lockstepped processors are generally satisfactory for creating a fault tolerant environment, implementing fault tolerance with two processors takes up valuable real estate.
A pipelined, simultaneous multithreaded, out-of-order processor generally can be lockstepped. A xe2x80x9cpipelinedxe2x80x9d processor includes a series of functional units (e.g., fetch unit, decode unit, execution units, etc.), arranged so that several units can be simultaneously processing an appropriate part of several instructions. Thus, while one instruction is being decoded, an earlier fetched instruction can be executed. A xe2x80x9csimultaneous multithreadedxe2x80x9d (xe2x80x9cSMTxe2x80x9d) processor permits instructions from two or more different program threads (e.g., applications) to be processed through the processor simultaneously. An xe2x80x9cout-of-orderxe2x80x9d processor permits instructions to be processed in an order that is different than the order in which the instructions are provided in the program (referred to as xe2x80x9cprogram orderxe2x80x9d). Out-of-order processing potentially increases the throughput efficiency of the processor. Accordingly, an SMT processor can process two programs simultaneously.
An SMT processor can be modified so that the same program is simultaneously executed in two separate threads to provide fault tolerance within a single processor. Such a processor is called a simultaneously and redundantly threaded (xe2x80x9cSRTxe2x80x9d) processor. Some of the modifications to turn a SMT processor into an SRT processor are described in Provisional Application Ser. No. 60/198,530.
Executing the same program in two different threads permits the processor to detect faults such as may be caused by cosmic radiation, noted above. By comparing the output data from the two threads at appropriate times and locations within the SRT processor, it is possible to detect whether a fault has occurred. For example, data written to cache memory or registers that should be identical from corresponding instructions in the two threads can be compared. If the output data matches, there is no fault. Alternatively, if there is a mismatch in the output data, a fault has occurred in one or both of the threads.
Although an SRT processor can provide lockstepped execution of redundant threads, forcing the programs to remain lockstepped imposes significant performance penalties. The performance suffers because the two threads are always competing for the same resources, so that no intelligent resource sharing is allowed. The two threads will also suffer the same latency caused by cache misses, and will suffer the same penalty for branch misspeculations. As explained in U.S. patent application Ser. No. 09/584,034 the performance of an SRT processor can be significantly enhanced by eliminating the lockstep requirement and introducing some slack between the execution of the threads. Each of the threads then gains a statistically improved access to processor resources, and is able to benefit in the normal way from out-of-order instruction execution. In addition, the trailing thread is allowed to avoid suffering any cache miss latency if the slack is chosen properly. Further, the branch information from the leading thread is provided to the trailing thread, so that the trailing thread is able to avoid any branch misspeculation. Whenever the slack between the two threads falls below some threshold, the instruction fetch circuitry preferentially fetches more instructions for the leading thread. The net result is faster execution for both threads, and an overall average performance improvement of about 16% has been achieved.
FIG. 1 shows a conceptual model which can be applied to a fault-tolerant system. The system is divided into a sphere of replication 10 and the rest of the system 12. The sphere of replication 10 represents the portion of the system that provides fault protection by duplication. This would include, for example, lockstepped processors (duplicate hardware) or SRT processors (duplication of execution). In FIG. 1, the duplication is shown by redundant execution copies 18, 19. The portion 12 of the system outside the sphere of replication 10 is protected by means other than duplication. This generally includes system memory and disk storage, and often includes cache memories. These portions are commonly protected against faults by parity checks or error correction coding.
The two portions of the system are conceptually coupled by an input replicator 14, and an output comparator 16. The input replicator 14 provides both of the redundant execution copies 18, 19 with identical values, and the output comparator 16 verifies that the output values match before it allows information to be sent to the rest of the system 12. This prevents any faults inside the sphere of replication 10 from propagating to the rest of the system, and it provides an opportunity for fault detection. Upon detecting a fault, the comparator 16 preferably initiates some kind of fault recovery procedure.
In a synchronous, lockstep system, input replicator 14 and output comparator 16 are so straightforward conceptually as to be almost overlooked. At any given clock cycle, the same input is provided to both execution copies 18, 19, and the outputs from both copies 18, 19 are compared for verification. However, the system of U.S. patent application Ser. No. 09/584,034 presents several issues that the replicator and comparator implementations must address. These include a variable slack between inputs and outputs for the execution copies, variable orders of inputs and outputs for the execution copies, and branch misspeculation by only the leading thread. Accordingly, it would be desirable to provide an input replicator implementation that addresses these issues in an efficient manner.
The problems noted above are in large part solved by a processor having an Active Load Address Buffer (xe2x80x9cALABxe2x80x9d) that ensures efficient replication of data values retrieved from the data cache. In one embodiment, the processor comprises a data cache, instruction execution circuitry, and an ALAB. The data cache provides temporary storage for data values recently accessed by the instruction execution circuitry. The instruction execution circuitry executes instructions in two or more redundant threads. The threads include at least one load instruction that causes the instruction execution circuitry to retrieve data from the data cache. The ALAB includes entries that are associated with data values that a leading thread has retrieved. The entries include a counter field that is incremented when the instruction execution circuitry retrieves the associated data value for the leading thread, and that is decremented with the associated data value is retrieved for the trailing thread. The entries preferably also include an invalidation field which may be set to prevent further incrementing of the counter field. This field may be used to stall the leading thread until the trailing thread has retrieved the data value the appropriate number of times, thereby returning the counter field to a zero value. Importantly, data blocks in the data cache are xe2x80x9cfrozenxe2x80x9d whenever they have an associated entry in the ALAB with a nonzero counter value. The data blocks are replaced only if no associated entry exists in the ALAB or the associated entry has a zero-valued counter field.