1. Field of the Invention
The present invention relates to pipelined superscalar processors. Specifically, the present invention deals with retrieving results from a completion unit register file prior to writing them into their destination registers.
2. Discussion of the Prior Art
A typical early microprocessor included a central processing unit (CPU) which implemented a machine capable of interfacing with memory and serially executing a sequence of instructions. The instruction execution was typically broken into at least four major stages: instruction and operand fetch, instruction decode, execution, and write back of results into the destination registers. A typical instruction took one clock cycle to execute, so that each of the four functions was performed in that cycle. Each stage had to wait for the results from the previous stage before its work could be accomplished. The minimum clock period was then determined by the longest possible propagation delay through all four stages.
The concept of pipelining increased the maximum clock frequency by reducing the amount of logic performed in each clock cycle. To facilitate this, for example, the interface between the second and third stages could be separated by clocked latches. The first two stages (fetch and decode) would be performed in one clock cycle. Subsequently, during a second clock cycle, the last two stages (execution and write back) would be performed. Here, the overall latency of an instruction might remain approximately the same since the total amount of time from the beginning of the fetch to the end of the write back would be approximately the same. However, separating the instruction execution into two distinct pieces has the important advantage that the throughput could be increased by as much as a factor of two. This is a result of the fact that the pipelined CPU can operate on two instructions simultaneously. While the execution and write back of one instruction is being performed, the fetch and decoding of a second instruction can be performed. Quite naturally, this pipelining concept can be extended such that each of the four stages is performed in a separate clock cycle, thus increasing the throughput accordingly. Thus, by dividing the logic into N separate segments, the throughput can theoretically be increased by a factor of N.
Superpipelined designs break up the logic in some or all of the four stages so as to reduce the maximum propagation delay through any one stage, and thus increase the operating frequency and throughput as the instruction execution is broken into more than four pipeline stages. A superscalar microprocessor 10 is shown in FIG. 1.
A superscalar processor 10 has several parallel functional units 11. Some typical superscalar implementations include floating point, integer, branch, and load/store functional units 11. Typically, among the functional units 11 are units which can perform floating point calculations or similar complex operations. It is desirable to run these complex units at the same clock frequency as the rest of the hardware, while still allowing for each functional unit 11 to begin executing a new instruction every cycle. To accomplish these objectives, pipelining of the parallel functional units 11 is desirable. The complexity and logical partitioning of the most complicated functional unit dictates the number of pipeline stages necessary in that functional unit 11. Not all of the functional units 11 have the same latency. Of these four parallel functional units 11, the floating point unit will probably turn out to have the most complexity. Because the other three types of functional units 11 are not as complex, it is possible to pipeline these other functional units into fewer stages than are required for the pipelining of the floating point unit. All of the parallel functional units 11 merge back into the final stage 12 in which the results of the executions are written into their respective destination registers.
If the pipeline for every parallel functional unit 11 is not the same number of stages, then the results from some functional units 11 will be available sooner than others. For example, if the pipelining of the floating point unit requires five stages, while the pipelining for the integer unit only takes two stages, then the results from the integer unit would be available three clock cycles prior to the results of the floating point unit even though both instructions were dispatched concurrently. By the same token, "younger" instructions can finish sooner than older instructions. For example, during one clock cycle a floating point instruction is dispatched, and during the next subsequent clock cycle, an integer instruction such as an addition is dispatched. If the integer pipeline is three stages shorter than the floating point pipeline, the integer addition result will be available two clock cycles before the floating point result even though the floating point instruction was dispatched first. In this example, the integer addition was a "younger" instruction because it was dispatched later than the "older" floating point instruction.
If some younger instructions are allowed to write into the destination registers before some older instructions, potential problems arise. For example, if both instructions write to the same destination register, the programmer expects the younger instruction to write over the results of the older instruction. If in fact the older instruction writes over the results of the younger instruction, the processor has not correctly executed the sequential program since the intended result does not appear in the intended destination. Therefore it is important to maintain the sequential nature of instruction result writeback.
Referring now to FIG. 2, in order to facilitate sequential write-back of instruction results, any shorter functional unit pipeline 20 is lengthened with extra pipeline stages so that all functional units have the same latency from dispatch 21 to write back 22. In the shorter functional unit pipeline 20, this results in several extra stages 23 being added to the end of the functional unit pipeline 20. These additional pipeline stages 23 effectively form a first-in-first-out (FIFO) buffer, known in the industry as a completion unit register file, which will be referred to as an annex.
Although preserving the sequential nature of the program execution and write-back, the addition of the extra stages 23 in the pipeline of the shorter functional unit 20 pipeline creates additional complications. Assume the shorter functional unit 20 in FIG. 2 is an integer ALU functional unit which might be used to execute a sequence of instructions. For example, consider the following program fragment where add rs1,rs2,rd denotes that the contents of register rs1 and the contents of register rs2 should be added and the result should be stored to register rd.
add r1,r2,r3;
add r3,r4,r5;
Here, the first instruction creates a result for register r3. The very next instruction uses the value for r3 that was computed by the previous instruction. Unfortunately, however, the value of r3 which was calculated by the first instruction has not been written into the register file by the time the second instruction begins execution. The destination register in the register file is not finally updated until the write-back stage 22 is reached at the end of the annex 23. Therefore, the correct operand of the second instruction is resident in the first entry of the annex when the second instruction is executing. In order to allow the entries of the annex to be utilized by subsequent instructions, some access to the annex 23 must be provided. In addition, some means must be provided to determine which register addresses are stored in each entry of the annex 23 in order to determine which entry, if any, should be provided back to the inputs of the functional units 20 and 24.
Referring to FIG. 3, an advantageous method involves the use of a memory 30 to implement the annex. If an entry in the annex 30 contains that most up-to-date or "youngest" version of a variable which needs to be used as an operand in a current instruction, that entry can be driven onto the annex access bus 31 and selected by the multiplexor 32 to provide the input to the functional unit 33. Instead of feeding through each entry of the annex 30, only one set of feedthroughs 31 are provided to a two input multiplexor 32. This eliminates increase in pitch of the functional unit datapath 34 regardless of the number of entries in the annex 30. The number of inputs to the multiplexor 32 remains two regardless of the number of entries in the annex 30.
One way to determine which entry, if any, should be provided to the multiplexor 32 involves the use of content addressable memory in the address field of the annex 30. Each entry in the annex 30 stores an address field and a data field. The address field contains the address of the destination register, while the data field holds the results to be stored to that destination register. When attempting to determine if one of the operands is in the annex 30, the address of that operand is simultaneously compared to all of the addresses stored within the annex. If there is a match, the match line for that entry is asserted.
Most instructions require more than one operand. For instance an add x,y,z instruction performs an addition of the contents of x to the contents of y and stores the result in z; thus x and y are operand registers. Because it is possible that data destined for both x and y are in the annex 30, more than one compare port and set of match lines in the content addressable memory should be provided to allow simultaneous look-up for all the operands.
The search for an annex entry which contains results for a particular operand destination register address is somewhat complicated by the fact that some processors support "virtual" register addresses which are different from the "physical" register addresses. In most implementations, the number of real physical addresses is different than the number of virtual addresses. Usually, there are more virtual addresses than physical addresses. When there are more valid virtual addresses than physical addresses, some of the virtual addresses must map to the same physical address. Multiple virtual addresses which map to the same physical address are called "aliases".
An architecture using register windows commonly uses virtual addresses which map to physical addresses. The operand registers and destination register addresses are referred to by their respective virtual addresses within instructions. There are many advantages to the use of register windows. Register windows allow for multiple register use without the time-consuming spills and fills to main memory characteristic of fixed register set designs found in many architectures. Using a large "windowed" register file, the processor has access to a large number of on-chip registers configured in overlapping sets. This access permits compilers to automatically cache values and pass parameters in registers. At any one instant, a program sees eight GLOBAL integer registers plus a 24-register window into a larger register file. The windowed registers can be used as a cache of procedure arguments, local values, and return addresses.
FIG. 4 illustrates the circular stack nature of register windows for an eight window implementation. FIG. 5 illustrates the same window structure from a flat perspective. At any time, a program can address 32 integer registers: 8 INs 50, 8 LOCALs 51, and 8 OUTs 52 of the active window, and the 8 GLOBAL registers 53 that are addressable from any window. The 8 OUTs 52 of one window are also the 8 INs 50 of an adjacent window. The overlapping nature of the register windows makes them ideal for multitasking environments like UNIX, where parameters can be quickly passed between procedures. Because of the overlapping nature of the register windows where specific registers are shared between the windows, there is no actual movement of parameters. Rather, parameters are simply shared between the caller and callee register windows.
The actual number of physical registers in the register file is less than the number of virtual registers in the virtual register files. In part, this is because the eight physical registers designated as GLOBAL registers 53 are shared among all virtual register windows. Moreover, the IN registers 50 of one virtual window and the OUT registers 52 of the adjacent virtual window are, in fact, the same physical register. Thus, for a virtual register file having K windows, each having N GLOBAL, N LOCAL, N IN, and N OUT registers, the total number of virtual registers is 4*K*N, while the total number of real physical registers is only (2K+1)*N.
FIG. 6 shows a typical mapping of the virtual register windows to physical registers 60. Within each window, the registers are virtually addressed in the same way. Registers r0! through r7! are the GLOBAL registers g0-g7; registers r8! through r15! are the OUT registers o0-o7; registers r16! through r23! are the LOCAL registers 10-17; and registers r24! through r31! are the IN registers i0-i7. A current window pointer (CWP) 61 keeps track of the current window.
Because of the unique virtual register file configuration, a virtual to physical register address 62 translation must take place each time a register is accessed. The translation is accomplished by applying the current window pointer (CWP) 61 and the virtual register number to a translator, which performs a calculation to obtain the physical register address 62. This is a moderately complicated calculation. In the mapping shown in FIG. 6, the physical address 62 could be translated by adding the virtual register index to the product of 16 and the current window pointer 61. If the binary representation of the current window pointer 61 is W2 W1 W0, and the binary representation of the virtual register index is R4 R3 R2 R1 R0, the translation for non-GLOBAL registers would be as follows. ##EQU1##
In the example mapping shown in FIG. 6, the GLOBAL registers are those registers whose virtual register index is between 0 and 7 inclusive. Thus, if R4 and R3 are zero, the physical address 62 is simply the register index.
Since the registers are designed to be the fastest memory that the processor accesses, it is desirable to minimize the number of times a virtual to physical register address 62 calculation must be performed. The designer must decide which type of address to use in the annex structure. Referring to FIG. 7, if the physical address 62 is used, then a virtual-to-physical address translation 70 must occur prior to the results of an instruction being entered into the annex 71. Furthermore, in order to search the annex 71 for operands of an instruction, each operand's virtual address must be translated prior to the search. This presents a large overhead in both hardware and delay.