Many different types of computing systems have attained widespread use around the world. These computing systems include personal computers, servers, mainframes and a wide variety of stand-alone and embedded computing devices. Sprawling client-server systems exist, with applications and information spread across many PC networks, mainframes and minicomputers. In a distributed system connected by networks, a user may access many application programs, databases, network systems, operating systems and mainframe applications. Computers provide individuals and businesses with a host of software applications including word processing, spreadsheet, accounting, e-mail, voice over Internet protocol telecommunications, and facsimile.
Users of digital processors such as computers continue to demand greater and greater performance from such systems for handling increasingly complex and difficult tasks. In addition, processing speed has increased much more quickly than that of main memory accesses. As a result, cache memories, or caches, are often used in many such systems to increase performance in a relatively cost-effective manner. Many modern computers also support “multi-threading” in which two or more programs, or threads of programs, are run in alternation in the execution pipeline of the digital processor. Thus, multiple program actions can be processed concurrently using multi-threading.
Another method processor architects utilize to increase the performance of their designs is to increase the processor's clock frequency. For a given technology, a higher frequency allows for more cycles of work to be done within a unit of time. One impact of this approach is that the amount of circuitry that may be executed in a processor cycle is reduced. Therefore, a corresponding reduction in complexity of the design is required to maximize the frequency. Another impact of high frequency designs is that, as clock frequencies increase, the time it takes for signals to travel across a VLSI chip can become significant, such that it may take many processor cycles for a signal to travel from one element of the chip to another.
Most modern computers include at least a first level cache L1 and typically a second level cache L2. This dual cache memory system enables storing frequently accessed data and instructions close to the execution units of the processor to minimize the time required to transmit data to and from memory. The L1 cache is typically contained within the processor core near the execution units. The L2 cache is typically kept physically close to the processor core. Ideally, as the time for execution of an instruction nears, instructions and data are moved to the L2 cache from a more distant memory. When the time for executing the instruction is near imminent, the instruction and its data, if any, is advanced to the L1 cache.
As the processor operates in response to a clock, an instruction fetcher accesses instructions from the L1 cache. A cache miss occurs if the instructions sought are not in the cache when needed. The processor would then seek the instructions in the L2 cache. A cache miss may occur at this level as well. The processor would then seek the instructions from other memory located further away. Thus, each time a memory reference occurs which is not present within the first level of cache the processor attempts to obtain that memory reference from a second or higher level of memory. When an instruction cache miss occurs, the instruction fetcher suspends its execution of the instruction stream while awaiting retrieval of the instruction from system memory. In a multi-threaded processor, the instruction fetcher may operate on another thread of instructions while awaiting the retrieval of the instruction. The processor execution units may still be operating on previous elements of the instruction stream, or may be operating on another thread of instructions. The instruction fetcher may also begin to initiate additional requests for instructions data from the memory hierarchy based on the instruction stream that missed the cache.
A common architecture for high performance microprocessors includes the ability to execute one or more instructions on each clock cycle of the machine. Execution units of modern processors therefore have multiple stages forming an execution pipeline. On each cycle of processor operation, each stage performs a step in the execution of an instruction. Thus, as a processor cycles, an instruction is executed as it advances through the stages of the pipeline.
In a superscalar architecture, the processor comprises multiple special purpose execution units to execute different instructions in parallel. A dispatch unit rapidly distributes a sequence of instructions to different execution units. For example, a load instruction may be sent to a load/store unit and a subsequent branch instruction may be sent to a branch execution unit. The branch instruction may complete execution at an earlier stage in the pipeline than the load instruction even though the load instruction originally preceded the branch instruction. This is so because more stages may be required to execute the load instruction than to execute the branch instruction. Additionally, instructions may execute at a variable stage in the processor pipeline depending on inter-instruction dependencies and other constraints.
In a superscalar architecture, instructions may be completed in-order and out-of-order. In-order completion means no instruction can complete before all instructions dispatched ahead of it have been completed. Out-of-order completion means that an instruction is allowed to complete before all instructions ahead of it have been completed, as long as a set of predefined rules are satisfied. Microprocessors may support varying levels of out of order execution support, meaning that the ability to identify and execute instructions out of order may be limited. One major motivation for limiting out of order execution support is the enormous amount of complexity that is required to identify which instructions can execute early, and to track and store the out of order results.
Additional complexities arise when the instructions executed out of order are determined to be incorrect per the in-order execution model, requiring their execution to not impact the state of the processor when an older instruction causes an exception. As processor speeds continue to increase, it becomes more attractive to eliminate some of the complexities associated with out of order execution. This will eliminate logic (and its corresponding chip area, or “real estate”) from the chip which is normally used to track out of order instructions, thereby allowing additional “real estate” to become available for use by other processing functions. The reduction in complexity may also allow for a higher frequency design.
Modern processor architectures also include an instruction fetcher that fetches instructions from the L1 instruction cache. The instruction fetcher will send instructions to a decode unit and an instruction buffer. The dispatch unit receives instructions from the instruction buffer and dispatches them to the execution units. When the instruction fetcher receives a branch instruction, the instruction fetcher may predict whether the branch is taken and select a corresponding instruction path to obtain instructions to pass to the instruction buffer. When the branch instruction is executed in an execution unit, the processor can then determine whether the predicted instruction path was correct. If not, the processor redirects the instruction fetcher to the correct instruction address and flushes the instruction buffer and pipeline of instructions younger than the branch instruction.
The instruction buffer that receives instructions from the instruction fetcher may comprise an instruction recirculator to re-introduce instructions into the pipeline when an instruction has already been dispatched, but is unable to execute successfully at the time it reaches a particular stage in the pipeline. In this case, stalling the instruction in the pipeline until execution is possible may introduce significant complexities associated with coordinating the stalling action, especially in a superscalar architecture where various execution pipelines may be impacted by a stall. Additionally, in a multi-threaded processor, stalling an execution pipeline may consume execution resources that could be utilized by another thread. For these and other reasons, it is often desirable to recirculate an instruction from the instruction buffer instead. For example, at a stage of execution of a load instruction, the data called for by the instruction may not be in the L1 data cache. Execution of the instruction then becomes stalled and the instruction is said to be rejected. When an instruction is rejected it can be sent from the instruction buffer back to the execution units to execute it when the data it calls for is retrieved. In many cases though, the condition that prevents successfully execution is such that the instruction will be likely to execute successfully if re-executed as soon as possible. For example, an L1 data cache may have multiple sets of data, each of which may contain the data sought by a load instruction. When a load instruction executes, many processors utilize a mechanism of set prediction under which the load will choose a particular subset of the available sets to check for the data. If the set prediction is incorrect, the set predictor is updated, and the load must be re-executed to obtain data from the correct set. In this, and may other cases, it is desirable for the rejected instruction to be re-introduced to the execution units by the instruction buffer as quickly as possible.
In a processor with limited out-of-order facilities, an instruction reject may require the re-execution of subsequent instructions as well as the rejected instruction itself since the results of younger instructions may be required to be discarded. In this case the instruction buffer will re-read the rejected instruction and subsequent instructions so that they may be re-executed. When one of these younger instructions is a branch instruction that executes and was mispredicted the design may require complex circuitry to handle both the instruction reject and branch mispredict flush when they occur in close proximity, or when the branch mispredict flush occurs after an instruction reject. Designs may therefore take steps to avoid these complexities such as by suppressing the branch execution for instructions younger than a reject.
However, as noted above, in a high frequency design, it may take many cycles for signals to travel between units within the processor. When an instruction is rejected, the reject indication may take multiple cycles before it reaches the branch execution unit. Because branch instructions may complete execution coincident with, or prior to an older instruction that requires more stages for execution there many be multiple younger branch instructions that are executed before an older rejected instruction can signal the branch execution unit to suppress execution. Therefore, the complexities associated with an instruction reject and a branch misprediction flush in close proximity are exacerbated since multiple branch instructions may execute even after an older instruction has rejected. These complexities can be a major problem for high frequency designs.