1. Field of the Invention
The present invention relates to computer microprocessors, and specifically to a method and apparatus by which a microprocessor transfers temporary speculative state to the user visible architectural state as instructions commit.
2. History of the Prior Art
Computer microprocessors are programmed with the assumption that each instruction completes and updates the user visible state of the processor (typically comprised of a plurality of registers and memory), also known as the architectural state, before the next instruction in the program executes. When instructions appear to the programmer to have executed in their original program order in this manner, the processor is said to exhibit sequential semantics.
To increase efficiency, modern microprocessors rearrange instructions out of program order when executing them, for instance to avoid stalling while waiting for an external memory access to complete, or to allow more than one instruction to execute at once. The process of executing an instruction is also referred to issuing it. Processors of the prior art typically dynamically schedule instructions out of order using hardware structures, such that a given instruction will only issue after all results it depends on have been generated. Processors may also issue instructions speculatively, such that instructions may issue before it is know if their execution is actually required (for instance, if the instruction resides along the path actually taken by a branch). Instructions may generate exceptions (for instance, by accessing an invalid memory address). The processor state comprised of results generated by instructions speculatively issued out of order is called the speculative state.
To preserve sequential semantics, the speculative state generated by a given instruction must not update the architectural state until it is known with certainty that the instruction should actually have been executed (i.e., it was executed along the path of branches actually followed by the program, and it generated no exceptions.) If the architectural state is updated prematurely, it will be impossible to recover from branch mispredictions, mis-speculations and exceptions, as the architectural state will have been corrupted by invalid data. Typically, microprocessors achieve sequential semantics by requiring all instructions to commit to the architectural state (i.e., update registers in the architectural register file and memory in the processor's caches) in their original program order, even if the actually issued out of order so as to complete faster. This ensures that the sequence of updates is identical to that generated by a processor executing all instructions in program order. This in-order commit is typically achieved using a reorder buffer (ROB), a structure familiar to those skilled in the art. Results are written to the ROB in the order in which they are generated but are read out and committed strictly in program order, as if reading from a queue.
The requirement that the results of instructions be committed to the architectural state strictly in program order is undesirable for several reasons. First, if the result of a given instruction is not ready, this instruction and all instructions after it in program order must wait for the not ready instruction to complete before the commitment process to continue. This constrains the throughput of the processor when the not ready instruction is for instance a load from memory, which may take a very long time to complete.
Second, the results of many instructions must be retained within the processor until they commit in program order, even if it is known that those results will never be used again by future instructions. This often greatly increases the internal resource requirements of the processor (for instance, physical registers, reorder buffers, store buffers and other structures known to those skilled in the art), increasing its complexity, decreasing performance and wasting electrical power.
Some microprocessor designs do not enforce sequential semantics by requiring instructions to commit strictly in program order. Instead, these designs use the concept of a trace, a sequence of instructions along a frequently executed and/or predicted path through the user program. Traces are comprised of a plurality of instructions including one or more operations that may change the control flow (path of execution through the program) and/or violate assumptions made in generating the trace, such as by causing an exception. These operations may include but are not limited to conditional branches, memory barrier operations, loads and stores that may cause memory related exceptions, et cetera. Instructions may be freely scheduled out of program order and/or executed speculatively within each trace so as to maximize performance, even if those instructions could cause exceptions or are along speculatively predicted branch paths, as will be appreciated by those skilled in the art. The Intel Pentium 4 microprocessor is an example of a design that arranges instructions into traces in the manner described; the instructions within each trace are then dynamically scheduled out of program order.
To ensure that speculative results do not contaminate the architectural state until they can be verified as correct, traces of the prior art typically have atomic semantics: at the successful completion of a trace (variously known as a commit point or checkpoint), all updates to the speculative state are simultaneously used to update the architectural state in one atomic operation. However, if any operation within the trace causes an exception or is found to be on the wrong branch path, the entire trace incurs a rollback, in which the speculative state is discarded and the processor returns to the last known good architectural state present before executing the trace. The processor then recovers from the rollback by performing an implementation specific recovery procedure, such as by executing each operation in its original program order until the excepting instruction is found or the correct branch path is resolved. A variety of methods may be used to separate the speculative architectural state from the committed last known good architectural state, and to update the committed state in one atomic operation. These methods are known from the prior art, for instance U.S. Pat. No. 5,958,061 (E. Kelly et al. Host microprocessor with apparatus for temporarily holding target processor state, September 1999) and U.S. Pat. No. 6,011,908 (M. Wing et al. Gated store buffer for an advanced microprocessor, January 2000).
An alternative paradigm in microprocessor design using the trace concept, called binary translation, takes a different approach to out of order execution. In a binary translation system, traces of instructions for a user instruction set are transparently translated to a different native instruction set composed of micro-operations (uops). These native code traces are then scheduled out of program order to improve performance and executed on simpler and faster processor hardware than would be possible if the hardware had to directly support the execution of user instructions. Each translated and scheduled trace is saved in a translation cache for immediate reuse at a later time in lieu of retranslating and rescheduling the trace every time it is encountered.
Typically the native hardware is in the form of a VLIW (Very Long Instruction Word) microprocessor core, which executes multiple independent uops per cycle by bundling them together and issuing one bundle per clock cycle. The VLIW processor core generally must be presented with a stream of uops already statically scheduled into bundles before execution; it does not dynamically reorder operations as they are encountered, as in a traditional out of order superscalar processor. The process of translating and/or scheduling uops into traces is typically done by a software layer written for the native uop instruction set, however this layer may also be implemented in a combination of hardware and/or software, as is described in U.S. Pat. No. 6,216,206 (G. Peled et al. Trace Victim Cache) and U.S. Patent Application 20030084270 (B. Coon et al. System and method for translating non-native instructions to native instructions for processing on a host processor, May 2003). In most microprocessors using binary translation in the context of a VLIW processor core, each trace is fully translated and statically scheduled before its first execution.
In a VLIW-based microprocessor using binary translation, atomic traces are typically implemented by encoding the final VLIW bundle in a given trace such that all speculative results accumulated during the execution of the trace are atomically committed to the architectural state at the time the final bundle completes execution.
Atomic traces can also be used in dynamically scheduled out of order processors. In Out-of-Order Commit Processors (A. Cristal et al., Intl. Symposium on High Performance Computer Architectures 2004), a mechanism is disclosed which allows results to commit out of program order. In this scheme, the architectural state is checkpointed at poorly predictable branches, and physical resources (physical registers, store buffers, et cetera) corresponding to a given result are freed when the corresponding architectural destination is overwritten in program order and when all known consumers of that result have issued (i.e., the result is said to be “dead”). H. Akkary et al. (Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors. IEEE Intl. Symposium on Microarchitecture 2003) present a similar approach to that of Cristal et al. but use different mechanisms, including the use of counters to track how many operations within each checkpoint are waiting to commit. Martinez et. al. (Cherry: Checkpointed Early Resource Recycling in Out-of-Order Microprocessors, IEEE Intl. Symposium on Microarchitecture 2002) present another checkpointing approach using shadowed architectural registers and a transactional data cache, similar to the '061 and '908 patents cited above. Hwu et al. (Checkpoint repair for high-performance out-of-order execution machines. IEEE Trans. on Computers 1987) present an overview of checkpointing techniques predating the above work.
Those skilled in the art will notice that the concept of a trace (atomic or otherwise) should not be confused with a thread. In the prior art, a thread comprises the flow of execution through a program as seen from the viewpoint of a single context (typically comprised of registers and memory locations). Multiple threads may be executed in parallel on a multi-threaded microprocessor providing a plurality of hardware contexts, or may be time-sliced by the operating system into a single hardware context. Unlike traces, threads are not bound to execute a specific subset of instructions from the program, and hardware threads may exist perpetually from the moment the processor is powered on. Certain microprocessors, such as the Intel Pentium 4, simultaneously utilize both traces and threads, wherein the continuous stream of instructions comprising each thread is itself divided into a plurality of traces at certain basic block boundaries, as defined above. In the Pentium 4, the instructions comprising each trace are decoded (as in binary translation) and written into a trace cache memory buffer prior to execution of each trace. Each trace is then executed by reading it from the trace cache and dynamically scheduling the constituent instructions out of program order. However, in the embodiment used in the Pentium 4 and similar designs, the traces are non-atomic, since every instruction within each trace is still committed in program order, rather than committing all or part of the trace at a single atomic checkpoint as described previously.
In all these approaches, the span of operations between any two checkpoints is considered an atomic trace and incurs a full rollback on any mispredict or exception, unlike the present invention. Additionally, even if a given result is dead, it must still occupy physical resources (i.e. registers and store buffers) within the processor core until its corresponding architectural destination is overwritten in program order. Furthermore, while operations from several checkpoints may be in the pipeline at any given time in the cited approaches, they cannot be intermixed such that they may fully execute and commit in parallel as with the present invention.