1. Field of the Invention
The present invention generally relates to computer systems, particularly microprocessors having execution units such as fixed-point units or floating point units, and more specifically to an arithmetic logic unit which carries out addition and subtraction operations.
2. Description of the Related Art
High-performance computer systems typically use multiple processors to carry out the various program instructions embodied in computer programs such as software applications and operating systems. A conventional microprocessor design is illustrated in FIG. 1. Processor 10 is generally comprised of a single integrated circuit superscalar microprocessor, and includes various execution units, registers, buffers, memories, and other functional units which are all formed by integrated circuitry. Processor 10 may operate according to reduced instruction set computing (RISC) techniques, and is coupled to a system or fabric bus 12 via a bus interface unit (BIU) 14 within processor 10. BIU 14 controls the transfer of information between processor 10 and other devices coupled to system bus 12, such as a main memory, by participating in bus arbitration. Processor 10, system bus 12, and the other devices coupled to system bus 12 together form a host data processing system.
BIU 14 is connected to an instruction cache and memory management unit (MMU) 16, and to a data cache and MMU 18 within processor 10. High-speed caches, such as those within instruction cache and MMU 16 and data cache and MMU 18, enable processor 40 to achieve relatively fast access time to a subset of data or instructions previously transferred from main memory to the caches, thus improving the speed of operation of the host data processing system. Instruction cache and MMU 16 is further coupled to a sequential fetcher 20, which fetches instructions for execution from instruction cache and MMU 16 during each cycle. Sequential fetcher 20 transmits branch instructions fetched from instruction cache and MMU 16 to a branch prediction unit 22 for calculating the next instruction fetch address, but temporarily stores sequential instructions within an instruction queue 24 for execution by other execution circuitry within processor 10.
The execution circuitry of processor 10 has multiple execution units for executing sequential instructions, including one or more fixed-point units (FXUs) 26, load-store units (LSUs) 28, floating-point units (FPUs) 30, and branch processing units (BPUs) 32. These execution units 26, 28, 30, and 32 execute one or more instructions of a particular type of sequential instructions during each processor cycle. For example, FXU 26 performs fixed-point mathematical and logical operations such as addition, subtraction, shifts, rotates, and XORing, utilizing source operands received from specified general purpose registers (GPRs) 34 or GPR rename buffers 36. Following the execution of a fixed-point instruction, FXUs 26 output the data results of the instruction to GPR rename buffers 36, which provide temporary storage for the operand data until the instruction is completed by transferring the result data from GPR rename buffers 36 to one or more of GPRs 34. FPUs 30 perform single and double-precision floating-point arithmetic and logical operations, such as floating-point multiplication and division, on source operands received from floating-point registers (FPRs) 38 or FPR rename buffers 40. FPU 30 outputs data resulting from the execution of floating-point instructions to selected FPR rename buffers 40, which temporarily store the result data until the instructions are completed by transferring the result data from FPR rename buffers 40 to selected FPRs 38. LSUs 28 execute floating-point and fixed-point instructions which either load data from memory (i.e., either the data cache within data cache and MMU 18 or main memory) into selected GPRs 34 or FPRs 38, or which store data from a selected one of GPRs 34, GPR rename buffers 36, FPRs 38, or FPR rename buffers 40 to system memory. BPUs 32 perform condition code manipulation instructions and branch instructions.
Processor 10 may employ both pipelining and out-of-order execution of instructions to further improve the performance of its superscalar architecture, but the present invention is particularly advantageous when used with in-order program execution or in cases where out-of-order execution capabilities are limited. For out-of-order processing, instructions can be executed by FXUs 26, LSUs 28, FPUs 30, and BPUs 32 in any order as long as data dependencies are observed. In addition, instructions may be processed by each of the FXUs 26, LSUs 28, FPUs 30, and BPUs 32 at a sequence of pipeline stages, in particular, five distinct pipeline stages: fetch, decode/dispatch, execute, finish, and completion.
During the fetch stage, sequential fetcher 20 retrieves one or more instructions associated with one or more memory addresses from instruction cache and MMU 16. Sequential instructions fetched from instruction cache and MMU 16 are stored by sequential fetcher 20 within instruction queue 24. Sequential fetcher 10 folds out branch instructions from the instruction stream and forwards them to branch prediction unit 22 for handling. Branch prediction unit 22 includes a branch prediction mechanism, which may comprise a dynamic prediction mechanism such as a branch history table, that enables branch prediction unit 22 to speculatively execute unresolved conditional branch instructions by predicting whether or not the branch will be taken.
During the decode/dispatch stage, instruction dispatch unit (IDU) 42 decodes and dispatches one or more instructions from instruction queue 24 to execution units 26, 28, 30, and 32. In addition, dispatch unit 42 allocates a rename buffer within GPR rename buffers 36 or FPR rename buffers 40 for each dispatched instruction' result data. Upon dispatch, instructions are also stored within the multiple-slot completion buffer of completion unit 44 to await completion. Processor 10 tracks the program order of the dispatched instructions during out-of-order execution utilizing unique instruction identifiers.
During the execute stage, execution units 26, 28, 30, and 32, execute instructions received from dispatch unit 42 opportunistically as operands and execution resources for the indicated operations become available. Each of execution units 26, 28, 30, and 32, are preferably equipped with a reservation station that stores instructions dispatched to that execution unit until operands or execution resources become available. After execution of an instruction has terminated, execution units 26, 28, 30, and 32, store data results, if any, within either GPR rename buffers 36 or FPR rename buffers 40, depending upon the instruction type. Then, execution units 26, 28, 30, and 32, notify completion unit 44 which instructions have finished execution. Finally, instructions are completed in program order out of the completion buffer of completion unit 44. Instructions executed by FXUs 26 and FPUs 30 are completed by transferring data results of the instructions from GPR rename buffers 36 and FPR rename buffers 40 to GPRs 34 and FPRs 38, respectively. Load and store instructions executed by LSUs 28 are completed by transferring the finished instructions to a completed store queue or a completed load queue from which the indicated load/store operations will be performed.
During the processing of program instructions, it is common to have a situation wherein the results of one operation are needed for the next instruction as an operand, in back-to-back cycles. This situation may be understood with reference to the following example of two instructions, an add operation followed by a subtract operation:
add r3, r1, r2
subf r5, r3, r4.
In the first instruction, the values in registers 1 and 2 (r1 and r2) are added and the sum is loaded into register 3 (r3). In the second instruction, the value in register 3 (r3) is subtracted from the value in register 4 (r4) and the difference is loaded into register 5 (r5). These instructions may be executed by an arithmetic logic unit (ALU) in either of the FXUs 26 or FPUs 30 of processor 10. The second instruction thus has a dependency on the first instruction, and if the first operation cannot be completed within a single cycle, the second operation must stall its execution, adversely affecting the overall performance of the processor.
As the operating frequencies of these machines increase, it is desirable to add more levels of logic to an execution unit to further enhance computation power and overall speed. However, the traditional method of generating the needed operands for later, dependent instructions limits the number of levels of logic in a pipeline stage, given the timing constraints. FIG. 2 shows a traditional implementation for an ALU 50 wherein the true and complement of an operand are generated and multiplexed for input into the ALU. The ALU includes an adder, a rotator, and a data manipulation unit. When a first instruction completes (such as an add operation) the result is issued to a result bus 52 that is connected to one of the inputs of a first operand multiplexer 54a and a second operand multiplexer 54b. After that operation completes, the next instruction is decoded by control logic 56 to determine the type of operation in the pipeline. If the current result is to be one of the operands for the next instruction, multiplexer 54a selects the result bus for input and passes the previous result to a latch 58. Latch 58 has two outputs, one connected to a first input of another multiplexer 62, and the other connected to an inverter 60 whose output feeds the second input of multiplexer 62. In this manner, multiplexer 62 can selectively output either the true or complement of the previous result to ALU 50 responsive to the control signal from control logic 56.
The operands and the control signals are both generated in the same cycle, synchronized by the E-latches, but because the design is datapath limited, ALU 50 has to wait while multiplexer 62 selects between the true and complement of operand. This logic delay is particularly troublesome when trying to design high frequency execution units, e.g., one gigahertz or higher. It would, therefore, be desirable to devise an alternative method to generating and multiplexing the true and complement in such a way that this latency is eliminated. It would be further advantageous if the method could make the overall datapath faster to facilitate higher frequency constructions.