This invention relates in general to microprocessors and, more particularly, to high performance superscalar microprocessors.
Like many other modern technical disciplines, microprocessor design is a technology in which engineers and scientists continually strive for increased speed, efficiency and performance. Generally speaking, microprocessors can be divided into two classes, namely scalar and vector processors. The most elementary scalar processor processes a maximum of one instruction per machine cycle. So called "superscalar" processors can process more than one instruction per machine cycle. In contrast with the scalar processor, a vector processor can process a relatively large array of values during each machine cycle.
Vector processors rely on data parallelism to achieve processing efficiencies whereas superscalar processors rely on instruction parallelism to achieve increased operational efficiency. Instruction parallelism may be thought of as the inherent property of a sequence of instructions which enable such instructions to be processed in parallel. In contrast, data parallelism may be viewed as the inherent property of a stream of data which enables the elements thereof to be processed in parallel. Instruction parallelism is related to the number of dependencies which a particular sequence of instructions exhibits. Dependency is defined as the extent to which a particular instruction depends on the result of another instruction. In a scalar processor, when an instruction exhibits a dependency on another instruction, the dependency generally must be resolved before the instruction can be passed to a functional unit for execution. For this reason, conventional scalar processors experience undesirable time delays while the processor waits pending resolution of such dependencies.
Several approaches have been employed over the years to speed up the execution of instructions by processors and microprocessors. One approach which is still widely used in microprocessors today is pipelining. In pipelining, an assembly line approach is taken in which the three microprocessor operations of 1) fetching the instruction, 2) decoding the instruction and gathering the operands, and 3) executing the instruction and writeback of the result, are overlapped to speed up processing. In other words, instruction 1 is fetched and instruction 1 is decoded in respective machine cycles. While instruction 1 is being decoded and its operands are gathered, instruction 2 is fetched. While instruction 1 is being executed and the result written, instruction 2 is being decoded and its operands are gathered, and instruction 3 is being fetched. In actual practice, the assembly line approach may be divided into more assembly line stations than described above. A more in-depth discussion of the pipelining technique is described by D. W. Anderson et al. in their publication "The IBM System/360 Model 91: Machine Philosophy", IBM Journal, Vol. 11, January 1967, pp. 8-24.
The following definitions are now set forth for the purpose of promoting clarity in this document. "Dispatch" is the act of sending an instruction from the instruction decoder to a functional unit. "Issue" is the act of placing an instruction in execution in a functional unit. "Completion" is achieved when an instruction finishes execution and the result is available. An instruction is said to be "retired" when the instruction's result is written to the register file. This is also referred to as "writeback".
The recent book, Superscalar Microprocessor Design, William Johnson, 1991, Prentice-Hall, Inc., describes several general considerations for the design of practical superscalar microprocessors. FIG. 1 is a block diagram of a microprocessor 10 which depicts the implementation of a superscalar microprocessor described in the Johnson book. Microprocessor 10 includes an integer unit 15 for handling integer operations and a floating point unit 20 for handling floating point operations. Integer unit 15 and floating point unit each include their own respective, separate and dedicated instruction decoder, register file, reorder buffer, and load and store units. More specifically, integer unit 15 includes instruction decoder 25, a register file 30, a reorder buffer 35, and load and store units (60 and 65), while floating point unit 20 includes its own instruction decoder 40, register file 45, reorder buffer 50, and load and store units (75 and 80) as shown in FIG. 1. The reorder buffers contain the speculative state of the microprocessor, whereas the register files contain the architectural state of the microprocessor.
Microprocessor 10 is coupled to a main memory 55 which may be thought of as having two portions, namely an instruction memory 55A for storing instructions and a data memory 55B for storing data. Instruction memory 55A is coupled to both integer unit 15 and floating point unit 20. Similarly, data memory 55B is coupled to both integer unit 15 and floating point unit 20. In more detail, instruction memory 55A is coupled to decoder 25 and decoder 40 via instruction cache 58. Data memory 55B is coupled to load functional unit 60 and store functional unit 65 of integer unit 15 via a data cache 70. Data memory 55B is also coupled to a float load functional unit 75 and a float store functional unit 80 of floating point unit 20 via data cache 70. Load unit 60 performs the conventional microprocessor function of loading selected data from data memory 55B into integer unit 15, whereas store unit 70 performs the conventional microprocessor function of storing data from integer unit 15 in data memory 55B.
A computer program includes a sequence of instructions which are to be executed by microprocessor 10. Computer programs are typically stored in a hard disk, floppy disk or other non-volatile storage media which is located in a computer system. When the program is run, the program is loaded from the storage media into main memory 55. Once the instructions of the program and associated data are in main memory 55, the individual instructions can be prepared for execution and ultimately be executed by microprocessor 10.
After being stored in main memory 55, the instructions are passed through instruction cache 58 and then to instruction decoder 25. Instruction decoder 25 examines each instruction and determines the appropriate action to take. For example, decoder 25 determines whether a particular instruction is a PUSH, POP, LOAD, AND, OR, EX OR, ADD, SUB, NOP, JUMP, JUMP on condition (BRANCH) or other type of instruction. Depending on the particular type of instruction which decoder 58 determines is present, the instruction is dispatched to the appropriate functional unit. In the superscalar architecture proposed in the Johnson book, decoder 25 is a multi-instruction decoder which is capable of decoding 4 instructions per machine cycle. It can thus be said that decoder 58 exhibits a bandwidth which is four instructions wide.
As seen in FIG. 1, an OP CODE bus 85 is coupled between decoder 25 and each of the functional units, namely, branch unit 90, arithmetic logic units 95 and 100, shifter unit 105, load unit 60 and store unit 65. In this manner, the OP CODE for each instruction is provided to the appropriate functional unit.
Departing momentarily from the immediate discussion, it is noted that instructions typically include multiple fields in the following format: OP CODE, OPERAND A, OPERAND B, DESTINATION REGISTER. For example, the sample instruction ADD A, B, C would mean ADD the contents of register A to the contents of register B and place the result in the destination register C. The handling of the OP CODE portion of each instruction has already been discussed above. The handling of the OPERANDS for each instruction will now be described.
Not only must the OP CODE for a particular instruction be provided to the appropriate functional unit, but also the designated OPERANDS for that instruction must be retrieved and sent to the functional unit. If the value of a particular operand has not yet been calculated, then that value must be first calculated and provided to the functional unit before the functional unit can execute the instruction. For example, if a current instruction is dependent on a prior instruction, the result of the prior instruction must be determined before the current instruction can be executed. This situation is referred to as a dependency.
The operands which are needed for a particular instruction to be executed by a functional unit are provided by either register file 30 or reorder buffer 35 to operand bus 110. Operand bus 110 is coupled to each of the functional units. Thus, operand bus 110 conveys the operands to the appropriate functional unit. In actual practice, operand bus 110 includes separate buses for OPERAND A and OPERAND B.
Once a functional unit is provided with the OP CODE and OPERAND A and OPERAND B, the functional unit executes the instruction and places the result on a result bus 115 which is coupled to the output of all of the functional units and to reorder buffer 35 (and to the respective reservation stations at the input of each functional unit as will now be discussed).
The input of each functional unit is provided with a "reservation station" for storing OP codes from instructions which are not yet complete in the sense that the operands for that instruction are not yet available to the functional unit. The reservation station stores the instruction's OP CODE together with operand tags which reserve places for the missing operands that will arrive at the reservation station later. This technique enhances performance by permitting the microprocessor to continue executing other instructions while the pending instruction is being assembled together with its operands at the reservation station. As seen in FIG. 1, branch unit 90 is equipped with a reservation station 90R; ALU's 95 and 100 are equipped with reservation stations 95R and 100R, respectively; shifter unit 105 is equipped with a reservation station 105R; load unit 60 is equipped with a reservation station 60R; and store unit 65 is equipped with a reservation station 65R. In this approach, reservation stations are employed in place of the input latches which were typically used at the inputs of the functional units in earlier microprocessors. The classic reference with respect to reservation stations is R. M. Tomasulo, "An Efficient Algorithm For Exploiting Multiple Arithmetic units" IBM Journal, Volume 11, January 1967, pp. 25-33.
As mentioned earlier, a pipeline can be used to increase the effective throughput in a scalar microprocessor up to a limit of one instruction per machine cycle. In the superscalar microprocessor shown in FIG. 1, multiple pipelines are employed to achieve the processing of multiple instructions per machine cycle. This technique is referred to as "super-pipelining".
Another technique referred to as "register renaming" can also be employed to enhance superscalar microprocessor throughput. This technique is useful in the situation where two instructions in an instruction stream both require use of the same register, for example a hypothetical register 1. Provided that the second instruction is not dependent on the first instruction, a second register called register 1A is allocated for use by the second instruction in place of register 1. In this manner, the second instruction can be executed and a result can be obtained without waiting for the first instruction to be done using register 1. The superscalar microprocessor 10 shown in FIG. 1 uses a register renaming approach to increase instruction handling capability. The manner in which register renaming is implemented in microprocessor 10 is now discussed in more detail.
From the above, it is seen that register renaming eliminates storage conflicts for registers. To implement register renaming, integer unit 15 and floating point unit 20 are associated with respective reorder buffers 35 and 50. For simplicity, only register renaming via reorder buffer 35 in integer unit 15 will be discussed, although the same discussion applies to similar circuitry in floating point unit 20.
Reorder buffer 35 includes a number of storage locations which are dynamically allocated to instruction results. More specifically, when an instruction is decoded by decoder 25, the result value of the instruction is assigned a location in reorder buffer 35 and its destination register number is associated with this location. This effectively renames the destination register number of the instruction to the reorder buffer location. A tag, or temporary hardware identifier, is generated by the microprocessor hardware to identify the result. This tag is also stored in the assigned reorder buffer location. When a later instruction in the instruction stream refers to the renamed destination register, in order to obtain the value considered to be stored in the register, the instruction instead obtains the value stored in the reorder buffer or the tag for this value if the value has not yet been computed.
Reorder buffer 35 is implemented as a first-in-first-out (FIFO) circular buffer which is a content-addressable memory. This means that an entry in reorder buffer 35 is identified by specifying something that the entry contains, rather than by identifying the entry directly. More particularly, the entry is identified by using the register number that is written into it. When a register number is presented to reorder buffer 35, the reorder buffer provides the latest value written into the register (or a tag for the value if the value is not yet computed). This tag contains the relative speculative position of a particular instruction in reorder buffer 35. This organization mimics register file 30 which also provides a value in a register when it is presented with a register number. However, reorder buffer 35 and register file 30 use very different mechanisms for accessing values therein.
In the mechanism employed by reorder buffer 35, the reorder buffer compares the requested register number to the register numbers in all of the entries of the reorder buffer. Then, the reorder buffer returns the value (or tag) in the entry that has a matching register number. This is an associative lookup technique. In contrast, when register file 30 is presented with a requested register number, the register file simply decodes the register number and provides the value at the selected entry.
When instruction decoder 25 decodes an instruction, the register numbers of the decoded instruction's source operands are used to access both reorder buffer 35 and register file 30 at the same time. If reorder buffer 35 does not have an entry whose register number matches the requested source register number, then the value in register file 30 is selected as the source operand. However, if reorder buffer 35 does contain a matching entry, then the value in this entry is selected as the source operand because this value must be the most recent value assigned to the reorder buffer. If the value is not available because the value has not yet been computed, then the tag for the value is instead selected and used as the operand. In any case, the value or tag is copied to the reservation station of the appropriate functional unit. This procedure is carried out for each operand required by each decoded instruction.
In a typical instruction sequence, a given register may be written many times. For this reason, it is possible that different instructions cause the same register to be written into different entries of reorder buffer 35 in the case where the instructions specify the same destination register. To obtain the correct register value in this scenario, reorder buffer 35 prioritizes multiple matching entries by order of allocation, and returns the most recent entry when a particular register value is requested. By this technique, new entries to the reorder buffer supersede older entries.
When a functional unit produces a result, the result is written into reorder buffer 35 and to any reservation station entry containing a tag for this result. When a result value is written into the reservation stations in this manner, it may provide a needed operand which frees up one or more waiting instructions to be issued to the functional unit for execution. After the result value is written into reorder buffer 35, subsequent instructions continue to fetch the result value from the reorder buffer. This fetching continues unless the entry is superseded by a new value and until the value is retired by writing the value to register file 30. Retiring occurs in the order of the original instruction sequence, thus preserving the in-order state for interrupts and exceptions.
With respect to floating point unit 20, it is noted that in addition to the float load functional unit 75 and a float store functional unit 80, floating point unit 20 includes other functional units as well, For instance, floating point unit 20 includes a float add unit 120, a float convert unit 125, a float multiply unit 130 and a float divide unit 140. An OP CODE bus 145 is coupled between decoder 40 and each of the functional units in floating point unit 20 to provide decoded instructions to the functional units. Each functional unit includes a respective reservation station, namely, float add reservation station 120R, float convert reservation station 125R, float multiply reservation station 130R and float divide reservation station 140R. An operand bus 150 couples register file 45 and reorder buffer 50 to the reservation stations of the functional units so that operands are provided thereto. A result bus 155 couples the outputs of all of the functional units of floating point unit 20 to reorder buffer 50. Reorder buffer 50 is then coupled to register file 45. Reorder buffer 50 and register file 45 are thus provided with results in the same manner as discussed above with respect to integer unit 15.
Integer reorder buffer 35 holds 16 entries and floating point reorder buffer 50 holds 8 entries. Integer reorder buffer 35 and floating point reorder buffer 50 can each accept two computed results per machine cycle and can retire two results per cycle to the respective register file.
When a microprocessor is constrained to issue decoded instructions in order ("in-order issue"), the microprocessor must stop decoding instructions whenever a decoded instruction generates a resource conflict (ie. two instructions both wanting to use the R1 register) or when the decoded instruction has a dependency, In contrast, microprocessor 10 of FIG. 1 which employs "out-of-order-issue" achieves this type of instruction issue by isolating decoder 25 from the execution units (functional units). This is done by using reorder buffer 35 and the aforementioned reservation stations at the functional units to effectively establish a distributed instruction window. In this manner, the decoder can continue to decode instructions even if the instructions can not be immediately executed. The instruction window acts as a pool of instructions from which the microprocessor can draw as it continues to go forward and execute instructions. A look ahead capability is thus provided to the microprocessor by the instruction window. When dependencies are cleared up and as operands become available, more instructions in the window are executed by the functional units and the decoder continues to fill the window with yet more decoded instructions.
Microprocessor 10 includes a branch prediction unit 90 to enhance its performance. It is well known that branches in the instruction stream of a program hinder the capability of a microprocessor to fetch instructions. This is so because when a branch occurs, the next instruction which the fetcher should fetch depends on the result of the branch. Without a branch prediction unit such as unit 90, the microprocessor's instruction fetcher may become stalled or may fetch incorrect instructions. This reduces the likelihood that the microprocessor can find other instructions in the instruction window to execute in parallel. Hardware branch prediction, as opposed to software branch prediction, is employed in branch prediction unit 90 to predict the outcomes of branches which occur during instruction fetching. In other words, branch prediction unit 90 predicts whether or not branches should be taken. For example, a branch target buffer is employed to keep a running history of the outcomes of prior branches. Based on this history, a decision is made during a particular fetched branch as to which branch the fetched branch instruction will take.
It is noted that software branch prediction also may be employed to predict the outcome of a branch. In that branch prediction approach, several tests are run on each branch in a program to determine statistically which branch outcome is more likely. Software branch prediction techniques typically involve imbedding statistical branch prediction information as to the favored branch outcome in the program itself. It is noted that the term "speculative execution" is often applied to microprocessor design practices wherein a sequence of code (such as a branch) is executed before the microprocessor is sure that it was proper to execute that sequence of code.
To understand the operation of superscalar microprocessors, it is helpful to compare scalar and superscalar microprocessors at each stage of the pipeline, namely at fetch, decode, execute, writeback and result commit. Table 1 below provides such a comparison.
TABLE 1 ______________________________________ Pipelined Superscalar Processor Pipelined (with out-of-order issue Pipeline Stage Scalar Processor & out-of-order comple.tion) ______________________________________ Fetch fetch one instruction fetch multiple instructions Decode decode instruction decode instructions access operands from access operands from register file register file and reorder buffer copy operands to copy operands to functional unit functional unit input latches reservation stations Execute execute instruction execute instructions arbitrate for result buses Writeback write result to register write results to file reorder buffer forward results to forward results to functional unit input functional unit latches reservation stations Result Commit n/a write result to register file ______________________________________
From the above description of superscalar microprocessor 10, it is appreciated that this microprocessor is indeed a powerful but very complex structure. Further increases in processing performance as well as design simplification are however always desirable in microprocessors such as microprocessor 10.