1. Technical Field
The invention relates generally to computer systems, and more particularly relates to computer processors with prefetch and branch units that prefetch instructions, including prefetching predicted branch target addresses supplied by the branch unit.
In an exemplary embodiment, the invention is used in an x86 processor to improve performance of prefetching and branch processing.
2. Related Art
Processors commonly use pipeline techniques to reduce the average execution time per instruction. An execution pipeline is divided into pipe stages--instructions are executed in stages allowing multiple instructions to be in the execution pipeline at the same time. For example, current x86 processor architectures generally use the following pipe stages:
______________________________________ IF Instruction Fetch (or Prefetch) ID Instruction Decode, including instruction length decode AC Address Calculation or Operand Access, including register file access, and for memory references, address calculation for operand load (either from cache or external DRAM) EX Execute, including arithmetic, logical, and shift operations WB Writeback of execution results, either writeback to the register file writeback or store to memory (cache or DRAM) ______________________________________
In particular, to keep the pipeline full, a prefetcher fetches instruction bytes into a prefetch buffer--instruction bytes are transferred to a decoder for decoding into instructions for execution in later stages of the pipeline. As the prefetch buffer is emptied by the decoder, the prefetcher fetches additional instruction bytes either (a) by incrementing the prefetcher IP (instruction pointer), or (b) by switching the code stream in response to a change of flow instruction (such as a branch).
Change of flow (COF) instructions interrupt the code stream, significantly impacting pipeline performance--COFs typically account for 15-30% of the instruction mix. For example, in the x86 instruction set architecture, COFs occur on the average every four to six instructions. COF instructions include branches (including LOOP instructions), jumps, and call/returns--branches are conditional in that the branch may be taken or not taken (depending, for example, on the status of condition codes), while jumps and call/returns are unconditional (always taken). Taken branches and unconditional COFs (UCOFs) interrupt the code stream to cause instruction fetch to proceed from a target address.
Without limiting the scope of the invention, this background information is provided in the context of a general problem to which the invention has application: in a pipelined processor that executes the x86 instruction set, improving performance and efficiency of prefetching and branch processing, and thereby the overall performance of the execution pipeline.
The x86 instruction set architecture (ISA) allows variable length instructions. For the 32-bit and 64-bit x86 architectures (i.e., currently the 486, 586, and 686 generations), instructions can be from 1 to 15 bytes in length (the average instruction is about 2.5 bytes). As a result, instructions will be misaligned in memory--typically, instruction length is decoded during the instruction decode stage of the execution pipeline.
The goal of instruction prefetch is to provide a continuous code stream in the form of instruction bytes to the decoder (thereby maintaining a continuous flow of instructions for execution). Some 4866 generation microprocessor used a two-block prefetch buffer operated as a circular queue--a current block and used to buffer instruction bytes being delivered to the decoder, while the other block was used in prefetching the next block of instruction bytes. Prefetch performance is significantly impacted by COF instructions.
The 486 generation microprocessors do not have a branch unit to provide dynamic prediction of branch direction--rather, branches are statically predicted not-taken and LOOPs are statically predicted taken. For branches, prefetching continues along the not-taken (fall through) path, and the execution pipe is flushed if the branch resolves taken in EX. For LOOPs, the prefetcher stalls until the target is fetched during AC/EX.
To improve pipeline performance on COFs, 586 and 686 generation microprocessors have included branch processing units to predict the direction of branches, and in the case of predicted taken branches (and UCOFs), to switch the prefetcher to the target address immediately. Branch processing significantly reduces the instances in which the prefetcher and decoder are stalled due to a COF, which is particularly important from a pipeline performance standpoint as execution pipelines are lengthened (for example, by superpipelining a stage, such as address calculation, into two stages.
A branch unit, includes a branch target cache (BTC) as well as branch prediction and branch resolution logic. When a branch is initially decoded and executed, then typically (based on the prediction algorithm), if the branch is taken, its target address is stored in the BTC as a predicted-taken branch (not-taken branches are typically not stored in the BTC)--the next time the branch is detected (during prefetch or decode), the BTC will supply the target address to the prefetcher. For each branch entry, the BTC typically stores (a) a tag identifying the branch instruction, (b) the associated predicted target address, and (c) one or more history bits used by the branch prediction logic--a conventional approach is to use as the BTC tag the address of the instruction prior to the COF to permit prefetching to switch to a predicted taken direction as this prior instruction and the COF instruction are decoding.
In particular, using the address of the instruction prior to the branch as the tag enables the BTC to be accessed, and a predicted-taken target address supplied to the prefetcher, in the clock prior to decoding the branch instruction. In response to a hit in the BTC, the prefetcher switches the code stream in the next clock to the target direction, making the target instruction bytes available to the decoder immediately after decoding the branch instruction (assuming the prefetch target address hits in the cache) without stalling the execution pipeline.
The branch prediction logic implements a prediction algorithm based on the history bits stored with the corresponding branch entry in the BTC. The actual branch direction (taken or not-taken) resolves in EX in response to condition code update--if the branch is mispredicted, branch resolution logic repairs the execution pipeline. Repair of mispredicted branches involves terminating execution of the instructions in the mispredicted direction, restoring the state of the machine, and restarting execution from the correct instruction (including prefetching in the not-predicted direction)--a branch misprediction results in a branch misprediction penalty corresponding to the number of clocks lost by mispredicting the branch.
Branch units typically store target addresses for all changes of flow--branches and as well as unconditional COFs (UCOFs) such as jumps and call/returns. In the case of UCOFs, no prediction is required, but the stored target address can be used to immediately switch prefetching to the target address (i.e., without waiting for the UCOF to be decoded).
The x86 ISA supports both segmentation and paging, and allows self-modifying code. In 586 and 686 generation processors, using a branch unit to supply target addresses to the prefetcher, and increasing the depth of the execution pipeline, necessitates taking into account segment limit checking and detecting self-modifying code.
Regarding segment limit checking, according to the 32-bit x86 memory management model (protected mode), addresses are generated using segmentation and, if enabled, paging. A code segment is defined by a segment base and segment limit both of which may be arbitrarily set in physical memory--a page is 4 Kbytes of physical memory. A segmented linear address (LA) is calculated by adding the segment base address to an offset (effective) address formed by adding two or three address components (relative base, displacement, and index)--this address is also the physical address (PA) if paging is not enabled. If paging is enabled, the physical address is obtained by translating the high order 20 bits [31:12] of the linear address to obtain a page base address--the low order bits [11:0] provide a 4 Kbyte offset address within the page. Thus, the low order bits of the linear address and the translated physical address are the same.
Each linear address calculation requires a segment limit check to determine if a linear address crosses the segment boundary. Separate code and data segments are defined--if the prefetcher crosses a code segment boundary, a segment limit violation exception is signaled.
The prefetcher typically maintains the linear and physical address for the current prefetch address (memory aligned), as well as the associated code segment limit. For sequential prefetching, the prefetcher increments the physical address to generate the prefetch address to the cache, and increments the corresponding linear address to detect if the prefetch address crosses the segment boundary (instruction bytes beyond the segment limit are invalidated).
The branch unit typically supplies physical target addresses to the prefetcher--when an entry in the BTC is allocated for a branch instruction, the associated target address is the physical address obtained from the AC stage after linear address calculation and page translation. Supplying a physical target address allows the prefetcher to immediately begin prefetching (accessing the cache) without the necessity of translating a linear address.
The target address supplied by the BTC is the address of the target instruction, which need not be memory aligned--the prefetcher or the cache logic will convert this target address into a memory aligned prefetch address by ignoring the low order bits (for example, bits [4:0] for 16 byte cache lines). Thus, the branch unit may supply a target address that would cause the prefetcher to jump into a prefetch block (i.e., cache line) containing a segment limit--while the prefetcher will have the physical prefetch address, it will not have the corresponding linear address to compare with the code segment limit (i.e., the target linear address is not generated until the COF instruction reaches the AC stage). As a result, the prefetcher may prefetch beyond the segment limit, which is contrary to the 486 specification.
Regarding self-modifying code, the standard 486 specification requires that a write instruction that modifies a "target" instruction be followed immediately by a jump to the modified target instruction--as a result, the target instruction is first modified by the write, and then fetched by the jump for execution. Not all 486 code follows this specification.
For 586 and 686 generation architectures, maintaining compatibility with existing software that includes self-modifying code is made problematic by architectural changes that increase the likelihood that a write to an instruction will not complete before the instruction is fetched. Such architectural features include dynamic branch prediction, increased prefetch buffer size, and store reservation stations (pre-cache write buffers).