State of the art computer processors employ a variety of techniques to accelerate performance. One such technique, referred to as pipelining, permits a processor to execute more than one instruction at a given time by executing instructions in multiple stages. For example, an initial stage may fetch instructions from memory, a second stage may decode instructions, a third stage may locate instruction operands, and so forth. Since each stage is able to operate on different instructions, multiple instructions can be executed at the same time, thus shortening the apparent time to execute any given instruction. The goal is for each stage to complete all associated operations on an instruction in a single clock cycle, such that instructions continuously advance to the next pipeline stage and an instruction completes execution each clock cycle. An extension of the pipelining concept, referred to as superpipelining, provides for pipeline stages to also have a pipelined (e.g. staged) structure comprising several sub-stages and provides further opportunity for enhancing processor performance.
Certain events may prevent a pipeline stage from completing its operations in a single clock cycle. One such event is the occurrence of a change of flow (COF) instruction, for example a branch, jump or call/return instruction. A branch instruction requires a branch to be taken or not taken, depending on whether predetermined conditions are met. Jump and call/return instructions are unconditional (always taken). Taken branches and unconditional change of flow instructions interrupt the instruction stream (e.g. "stall" the pipeline) to cause instruction fetch to proceed from a new target instruction, and thus have a potential to slow performance.
Another event which may cause a pipeline stall is an exception. An exception is an interruption in program flow resulting from the execution of instructions, for example, a floating point overflow.
One technique used to avoid or minimize such pipeline stalls is speculative execution. Speculative execution, also referred to as out-of-order execution, refers to a practice of allowing instructions to be executed out of their original programmed sequence on the speculation that the result will be needed. For example, in the case of an instruction sequence including a branch operation, the processor may predict whether the branch will be taken or not taken and speculatively continue to execute instructions based on the prediction. Similarly, in the case of an instruction sequence including a floating point instruction, instructions which follow the floating point instruction may be allowed to advance prior to the completion of the floating point instruction. If subsequent events indicate the speculative instruction would not have been executed (e.g. the branch prediction was incorrect, or a floating point error occurred), the processor abandons any result the instruction produced and returns execution to the point before speculative execution occurred. The rationale behind out-of-order execution is that the increase in efficiency realized by avoiding pipeline stalls outweighs the loss of efficiency due to redundant execution which may need to occur to clean-up after false starts.
In processors employing out-of-order execution, there is normally provided a completion buffer or other similar means which is used to manage the results of the out-of-order execution so that the instructions appear to be executed sequentially. The completion buffer may use instruction tags which are numbers assigned to instructions at the start of execution, and which are carried through execution to track program order. When an instruction finishes execution, the results are stored in a temporary buffer while the completion buffer keeps track of the program sequence. When the completion buffer determines that a particular instruction is the next in the programmed sequence, it allows the results of the instruction stored in the temporary buffer to be stored permanently (e.g. in designated registers or memory).
Another technique used to accelerate performance of a processor involves memory addressing. The use of real addresses and virtual addresses are known in the art. Real addresses, also referred to as physical addresses, represent actual physical locations in memory. Physical memory is generally divided into relatively large blocks referred to as segments, which are in turn divided into smaller blocks referred to as pages. Virtual addresses are temporary addresses employed as a convenience for programming so that the programmer need not keep track of which physical addresses are in use and which are available; instead the processor keeps track of the availability of each physical address as well as the correspondence between virtual and physical addresses. A variety of schemes for translating between virtual addresses and real addresses are also known. Normally, one or more tables of virtual addresses (or portions thereof) and the corresponding real address (or portions thereof) are stored in memory. Translation schemes may involve a combination of calculation steps along with segment table searching and page table searching (e.g. "page table walk"). Since segment and page table searching can be quite time consuming, it is known in the art to store a number of recently used entries from the segment and page tables in separate smaller-sized buffers (e.g. segment lookaside buffer (SLB) and/or translation lookaside buffer (TLB)), so as to decrease access time for addresses that are frequently used.
In a pipelined processor employing speculative execution, speculative execution of storage related instructions, such as LOAD and STORE instructions, can create problems. A LOAD instruction receives an operand from a target address in cache or memory, while a STORE instruction writes an operand to a target address in cache or memory.
FIG. 1 depicts a block diagram of a relevant portion of an exemplary prior art pipelined processor for speculatively executing LOAD/STORE instructions. More particularly, shown in FIG. 1 are a processor core 10, portions of which are further described below, and a cache and memory management unit (CMMU) 12. The processor core 10 executes instructions while the cache and memory management unit 12 controls access to cache and memory. Included in the processor core 10 is at least one load store unit (LSU) 14 (two are shown as 14a and 14b) for executing LOAD and STORE instructions, a completion buffer 16 for managing the results of out-of-order execution driven by the LSU's 14a and 14b and other execution units, and a writeback unit 18, a temporary buffer for holding pending store operands. Included in the cache and memory management unit 12 is a cache 20 and its associated tag directory, a cache controller 21, a memory management unit 22, and a translation unit 24 for translating between real and virtual addresses. The translation unit 24 may include, for example, a translation lookaside buffer (TLB) 26 and/or a segment lookaside buffer (SLB) 28 for accelerating translation of virtual addresses to real addresses as described above. The translation unit 24 also normally includes circuitry for translating addresses not found in the TLB 26 and SLB 28 (e.g. circuitry for conducting a page table walk).
The prior art system shown in FIG. 1 might execute STORE instruction as follows. The execution is conceptually broken into three pipelined stages (not shown as distinct hardware stages). The first stage is an arbitration stage (ARB) in which the cache controller 21 determines for each clock cycle which outstanding cache request will be serviced in the current clock cycle. An instruction cannot complete the ARB stage unless both the cache 20 and the TLB 26/SLB 28 are available. In the second stage, or the access stage (ACC), the cache and memory management unit 12 performs virtual memory address translation to form a real address by means of the TLB 26 and/or the SLB 28 and also checks for protection violations (a protection violation results when an access request is inconsistent with a security classification assigned to a particular portion of memory). The resultant real address is then used by the CMMU 12 to interrogate the cache tag directory for the translated address. If there is a "hit" in the cache 20 and in the TLB 26 and SLB 28 (e.g. the translated address is found and is not write protected), the cache 20 and/or external memory (not shown) is written with the data (STORE). If the real address is not found in the TLB/SLB 26/28, a miss signal is returned to the processor core 10. A third stage (MISS stage) handles misses arising during the access stage. That is, a full translation is performed (e.g. including full table walk) and the SLB 28 and TLB 26 are updated prior to storing the data.
For processors employing speculative execution, STORE instructions are normally executed in two passes. A first pass (PASS1) involves the arbitration stage (ARB), as described above, and aspects of the access stage (ACC1). More particularly, ACC1 includes address translation as described above, as well as cache interrogation so as to "preapprove" the STORE. A second pass (PASS2) involves the arbitration phase (ARB) and the access stage (ACC2), resulting in an unconditional write to the translated target address in cache and/or memory.
In a processor employing speculative execution, an indeterminate number of clock cycles may occur between the end of PASS1 and the beginning of PASS2 in executing a STORE instruction, thereby providing an opportunity for intervening operations (e.g. such as a LOAD instruction or another PASS1 store instruction) to alter the contents of the SLB/TLB 26/28. Thus, if in PASS1 there was a SLB/TLB 26/28 hit (e.g. address found in SLB/TLB), the translated address needed in PASS2 may been overwritten by intervening out-of-order operations, prior to the beginning of PASS2.
When the translated address in the SLB/TLB 26/28 is overwritten by intervening operations, recovery may require a number of complex operations. First, a full table-walk translation must be performed to derive the appropriate address, and the STORE instruction must be retried using the translated address. After retrying the STORE operation, the data may need to be sent to memory, as well as the cache, to insure coherence between cache and memory.
In addition, instruction fetching may need to be suspended because the information needed to properly manage STORES to the instruction stream (e.g. refers to instructions which alter instructions later in the program sequence, in other words, "self-modifying code") is unavailable. More particularly, after completing a STORE instruction, the cache is normally checked for the physical address to which a result is being stored (e.g. snooping) in order to determine whether the STORE instruction has changed an instruction which follows it in program order. If the physical address is found, the cache halts instruction fetching and either stores the new data into the appropriate line or invalidates the appropriate line, depending on the desired cache protocol (e.g. write-through, write-back); if the address is not found, instruction fetching continues and the STORE is completed. In the case of a STORE instruction which misses the TLB/SLB in PASS2, the snooping operation cannot occur because the physical address needed is not available; in addition the virtual address may no longer be available. Therefore, instruction fetching must be suspended until after the full translation occurs.
In addition, until the full translation occurs, external snooping or, in other words, bus snooping (e.g. by devices peripheral to the processor) may need to be suspended since a physical address may be required for external snooping.
Each of these operations requires complicated logic to implement, which adds delay to critical paths, and takes up valuable silicon real estate, all for an event that occurs relatively infrequently.
An alternative to the complex controls described above is to maintain in a buffer the real address for all STORE instructions in progress until after completion. In a super-scalar superpipelined processor, such a buffer would require a large number of registers, each having enough bits to hold a real address, thus requiring additional space as well as complex logic to manage the buffer.
Another alternative would be to structure STORE instructions so that PASS2 occupies 2 clock cycles. More particularly, such a scheme might in a first cycle translate an address that missed in PASS 1 (e.g. perform full page table walk) and in a second cycle perform an unconditional write to cache or memory. Such a scheme, however, significantly hampers the performance of the processor.