1. Field of the Invention
The present invention relates to the field of electronic circuits. More specifically, embodiments of the present invention relate to Arithmetic Logic Units (ALUs), and in particular, ALUs included in a pipelined processor.
2. Description of the Related Art
Users of data processing systems such as computers and the like continue to demand greater and greater performance from such systems for handling increasingly complex and difficult tasks. Greater performance from the processors that operate such systems may be obtained through faster clock speeds, so that individual instructions are processed more quickly. However, relatively greater performance gains have been achieved through performing multiple operations in parallel with one another.
One manner of parallelization is known as “pipelining”, where instructions are fed into a pipeline for an execution unit in a processor that performs different operations necessary to process the instructions in parallel. For example, to process a typical instruction, a pipeline may include separate stages for fetching the instruction from memory, executing the instruction, and writing the results of the instruction back into memory. Thus, for a sequence of instructions fed in sequence into the pipeline, as the results of the first instruction are being written back into memory by the third stage of the pipeline, a next instruction is being executed by the second stage, and still a next instruction is being fetched by the first stage. While each individual instruction may take several clock cycles to be processed, since other instructions are also being processed at the same time, the overall throughput of the processor is much greater. With respect to pipelining, the term “stage” generally refers to the combinational logic between registers or latches.
Pipelining is the placing of logic between various types of memories. Known memories include registers, latches, and Random Access Memory (RAM). A register is a type of word-based memory that stores a set of bits, and generally, all the bits are written in parallel on the edge of a clock or similar event in time. A latch is a type of word-based memory that stores a set of bits, and generally, the bits are stored while an enable signal is active, thereby allowing input changes to propagate to outputs while the enable signal is active. A latch is sometimes called a “half-register”. Putting logic between half-registers has the advantage of partial cycle stealing from a prior stage, and can reduce the cycle time of a pipelined circuit. Random Access Memory (RAM) is an array-based memory that stores a plurality of words, each word being a set of bits. RAMs can have a plurality of access ports, thereby allowing multiple reads and/or writes from/to the RAM. Fast RAM, generally with multiple access ports, is sometimes called a register file.
Individual arithmetic operations, such as addition and multiplication, can also be pipelined. For example, a multiplier can be designed with four stages, and take four clock cycles to compute a result corresponding to a particular input, but accept new inputs each clock cycle. Pipelining can be applied to memories as well. For example, a memory could have the following stages: address decode; memory array access; and data output. A pipelined circuit can be composed of many stages, and include a plurality of memory, arithmetic, and logic circuits.
Greater parallelization can also be performed by attempting to execute multiple instructions in parallel using multiple pipelined execution units in a processor. Processors that include multiple execution units are often referred to as “superscalar” processors, and such processors include scheduling circuitry that attempts to efficiently dispatch instructions to different execution units so that as many instructions are processed at the same time as possible. Relatively complex decision-making circuitry is often required, however, because oftentimes one instruction cannot be processed until after another instruction is completed. For example, if a first instruction loads a register with a value from memory, and a second instruction adds a fixed number to the contents of the register, the second instruction typically cannot be executed until execution of the first instruction is complete.
The use of relatively complex scheduling circuitry can occupy a significant amount of circuitry on an integrated circuit device, and can slow the overall execution speed of a processor. For these reasons, significant development work has been devoted to Very Long Instruction Word (VLIW) processors, where the decision as to which instructions can be executed in parallel is made when a program is created, rather than during execution. A VLIW processor typically includes multiple pipelined execution units, and each VLIW instruction includes multiple primitive instructions known as parcels that are known to be executable at the same time as one another. Each primitive instruction in a VLIW may therefore be directly dispatched to one of the execution units without the extra overhead associated with scheduling. VLIW processors rely on sophisticated computer programs known as compilers to generate suitable VLIW instructions for a computer program written by a computer user. VLIW processors are typically less complex and more efficient than superscalar processors given the elimination of the overhead associated with scheduling the execution of instructions.
It is common practice for pipelined logic to be synchronously clocked. That is, a single timebase clocks the entire circuit. Alternatively, various portions of the pipelined logic can be clocked with different timebases (i.e., different frequencies), and these different timebases are usually (although not necessarily) rational number multiples of each other, thereby allowing them to be derived from a single frequency source. In the case of asynchronous circuits, there can be multiple timebases that are asynchronous to one another. It is also possible for registers to be clocked by detecting when the computation of input data is complete (i.e., self-timed circuits), resulting in fully asynchronous behavior.
One design consideration in pipelined circuits is stalling. Stalling occurs in a pipelined circuit when at least one stage waits for some data. A simple example is an execution unit waiting until a next instruction is available; that is, the execution unit stalls. In general, mitigation of stall is desirable, but it is the average sustainable performance that is being maximized in most designs. Hence, introducing infrequent stalls in order to increase overall performance can be a fruitful design choice.
Another design consideration in pipelined circuits is the critical path. The critical path is the path through a circuit that takes the longest time to propagate from input to output. The critical path determines the smallest allowable clock period where, the smaller the clock period, the higher the performance. Accordingly, the performance is inversely related to the clock period. In pipelined circuits, this critical path is measured from register-to-register or latch-to-latch (or between any two of the various types of memory circuits).
One possible critical path in pipelined logic is through an ALU. A typical ALU performs at least some of the following operations: adds, shifts, rotates, AND, OR, NAND, NOR, and the like operations. Generally, the critical path through an ALU occurs for the add operation, primarily due to an arithmetic carry through all the bits. An arithmetic carry is the “carry our” from a bit position into the next most significant bit. For example, in an 8-bit adder, adding the bit patterns ‘01111111’ and ‘01111111’ causes arithmetic carries to propagate through all the bit positions. A simple type of adder allows carry values to ripple from the least significant bit to the most significant bit, but this is slow due to a long critical path. More sophisticated adders use a carry-look-ahead circuit to generate carry values. But, even for carry-look-ahead circuits, wider (i.e., more bit position) adders have a longer critical path.
Recently, microprocessor architectures have been extended from 32-bit architectures to 64-bit architectures. This change increases the width of the ALU, increasing the critical path delay through the ALU (e.g., by increasing the number of bits for the carry-look-ahead logic) and reducing performance. Hence, it is desirable to reduce the critical path though 64-bit ALUs.
Prior art solutions split each of the ALU's two 64-bit two operands (A and B) into two 32-bit operands by separating the high order bits from the low order bits (respectively, AH and AL, and BH and BL). AL and BL are fed into a first 32-bit adder, and AH and BH are fed into both a second adder and a third adder. The second adder has its carry input set to “0”, while the third adder has its carry input set to “1”. The output sum from the first adder forms the lower bits of the result. The carry out from the first adder selects (by a multiplexer) between the output sum from the second adder and the output sum from the third adder to form the higher bits of the result. The three adders operate in parallel. Three adders are used so that the addition of the upper bits can start before the carry output from the low order bits is determined. While the critical path is reduced, the disadvantages are: additional area due to three adders (rather than two); circuit delay due to the multiplexer; and additional power consumption due to additional switching of circuits.
In the context of an ALU, forwarding of an arithmetic result refers to making use of the result in the next clock cycle. For example, if X+Y+Z is to be computed, then X+Y is computed in a first arithmetic unit and forwarded to a next arithmetic unit that adds the Z value. Forwarding must be done in an efficient manner, without increasing the critical path or causing the pipeline to stall frequently. Forwarding is generally done by routing the wires of a data bus from the first arithmetic unit to the second. In the prior art, in order to reduce the critical path delay, wide metal wires are used for forwarding, but this has the significant disadvantages of: consuming numberous routing tracks (e.g. 4× width wires instead of the smallest 1× width wires); and requiring larger buffers to drive a larger wiring load, thereby increasing area and power consumption.
Therefore, there is a need for a fast forwarding ALU that overcomes the deficiencies described above.