This invention relates to arithmetic logic units for use in computer processors. More particularly, this invention relates to pipelined arithmetic logic units that improve the performance of such processors.
An arithmetic logic unit (hereinafter “ALU”) is one of the fundamental building blocks of a processor (e.g., for use in a computer or other electronic device). The ALU is a combinatorial circuit that performs a set of basic arithmetic and logic operations. These operations can be performed on one or more binary words received by the ALU. Binary words, also referred to as n-bit words, are strings of zeros and ones (e.g., “010011”). The ALU may add or subtract one binary word to or from another binary word to obtain a result. The ALU may also subject one or more binary words to AND, OR, XOR (i.e., exclusive-OR), and NOT logic operations.
Arithmetic operations are performed by an arithmetic circuit in the ALU. Typically, the arithmetic circuit includes an adder, which can be constructed, for example, from a number of full-adder circuits connected in cascade. The operations performed by the adder can be selected by controlling the adder's inputs. For example, if the ALU operates on one or more control signals, those signals can instruct the arithmetic circuit to perform a specified operation (e.g., subtract, addition, increment, or decrement).
Note that ALUs can also include multiplier circuitry for executing arithmetic operations such as, for example, multiplication and division. However, operations executed by multiplier circuitry generally require more time to complete than operations executed by adder circuitry. Thus, multiplier circuitry can limit an ALU's performance.
Logic operations are performed by a logic circuit in the ALU. The logic circuit typically performs the above-mentioned AND, OR, XOR, and NOT operations by performing individual bit-to-bit (i.e., bit wise) operations. That is, a respective bit (e.g., the least significant bit) of each n-bit word is subjected to the desired logic operation to provide a single word result. If the logic circuit performs AND, OR, XOR, and NOT operations, other known logic operations such as NAND (i.e., not AND), NOR (i.e., not OR), and XNOR (i.e., exclusive not OR) can also be performed by the logic circuit. Logic operations performed by the logic circuit can be based on one or more control signals. Note that these control signals can be common to both the logic and arithmetic circuits.
In conventional processors, the speed at which operations are performed by the ALU is often limited by the arithmetic circuit. Particularly, in single-cycle execution processors, the speed of the processor is limited by the adder circuitry of the arithmetic circuit. In a single clock cycle ALU, the ALU can generate results and flags useable in the immediately following clock cycle, thus achieving a one operation per clock cycle throughput.
The performance of single-cycle ALUs is generally not limited by the logic circuit because logic operations execute quickly, at least with respect to arithmetic operations. The logic circuit generally has a logic depth of just one gate (e.g., an AND gate) that data signals need to traverse in order to perform the desired operation. The arithmetic circuit, however, often has a depth greater than that of the logic circuit. Therefore, the more complex adder data paths often limit ALU performance.
To compensate for the delay caused by the adder, ALU operations can be pipelined. Pipelining increases the speed of the processor and can be accomplished, for example, by inserting one or more registers in the adder data paths. The addition of a register improves the processing speed of the ALU by shortening the cycle time (less time is required for data to reach the inserted registers(s)). However, arithmetic operations now require two or more clock cycles to complete. Even though pipelining causes some operations to require more than one clock cycle, operations can still be processed quicker than their single-cycle ALU counterpart. The concept behind pipelining is analogous to an assembly line process. In an assembly line process, construction of, for example, a large article of manufacture is performed by assembling in parallel subassemblies of the finished article. The article is built relatively quickly because the subassemblies are constructed separately and substantially, or at least partially, simultaneously before being combined to produce the finished article. Similarly, in pipelined arithmetic operations, various components of the final result are computed in a similar manner and are combined at the end of the arithmetic operation.
However, because more than one clock cycle is required to complete arithmetic operations, a mechanism is ordinarily required to prevent erroneous use of incomplete or incorrect ALU results in a subsequent operation. One such mechanism is a hardware interlock that inserts “dead” execution cycles. These dead execution cycles ensure that operations are completed before their results are used for subsequent operations in the ALU or processor (e.g., they allow data to propagate to an appropriate location). Similarly, dead execution cycles can be inserted by software. For example, a software compiler can insert dead execution cycles appropriately as needed.
While dead execution cycles ensure proper operation of a pipelined ALU, they reduce the performance of the processor by limiting the processing of data during each clock cycle. For example, the pipelined ALU can generate a partial result of an operation relatively quickly, but before that partial result can be used in a subsequent operation, the ALU may have to wait while dead execution cycles pass. Thus, during dead execution cycles, the pipelined ALU is idling and not executing any subsequent operations.
In view of the foregoing, it would be desirable to provide an ALU with improved efficiency that reduces, if not eliminates, the need to intentionally stall operations.