The demand for high performance computers and communication devices requires that state-of-the-art digital signal processors (DSPs) and general purpose microprocessors, such as x86 based microprocessors, execute instructions in the minimum amount of time. A number of different approaches have been taken to decrease instruction execution time, thereby increasing processor throughput. One way to increase processor throughput is to use a pipeline architecture in which the processor is divided into separate processing stages that form the pipeline. Instructions are broken down into elemental steps that are executed in different stages in an assembly line fashion.
Superpipelining refers to the simultaneous processing of multiple instructions in the pipeline. For example, if a processor executes each instruction in five stages and each stage requires a single clock cycle to perform its function, then five separate instructions can be processed simultaneously in the pipeline, with the processing of one instruction completed during each clock cycle. Hence, the instruction throughput of an N stage pipelined architecture is, in theory, N times greater than the throughput of a non-pipelined architecture that completes only one instruction every N clock cycles. However, the speed improvements provided by pipeline architectures and superpipelining processing are ultimately limited by speed at which the individual stages in the pipeline execute. It is therefore important to minimize the time required to execute each part of an instruction.
Mathematical operations often incur substantial time delays in calculating a value, particularly additions of two arguments containing a large number of bits. For example, adding two 32-bit numbers or two 64-bit numbers often results in a substantial time penalty. The delay is usually related to the propagation delay of a carry bit throughout the length of the sum. For example, a typical 32-bit adder may have as its inputs a first 32-bit argument, a second 32-bit argument, and a carry-in bit from a previous stage. The 32-bit adder may comprise 32 carry-sum cells, where each cell receives one bit from each argument and a carry bit from a lower order cell and generates a sum bit and a carry bit. For example, the fifth carry-sum cell receives the fifth bit from a first 32-bit argument, the fifth bit from a second 32-bit argument, and the carry bit from the fourth cell. Unfortunately, the output of the 32-bit adder is not correct until the carry bit ripples through all 32 cells. This results in a significant time delay.
Therefore, there is a need in the art for improved data processors maximize instruction throughput. In particular, there is a need in the art for improved adder circuits capable of rapidly adding two arguments containing a large number of bytes. More particularly, there is a need for improved adder circuits that minimize the carry ripple delay caused by adding two large arguments.