1. Field of the Invention
This invention relates generally to data processors, and more particularly to execution units within data processors.
2. Description of Related Art
Most microprocessors used today employ 32-bit data paths, and a trend for microprocessor design has been to increase data path widths to increase processing power. Increasing the width of data paths allows more bits of data to be processed per cycle. However, with conventional microprocessor architectures, simply increasing the width of data paths increases the data element size, for example, from 32-bit to 64-bit, and may improve accuracy in certain calculations, but does not increase the processing rate for data elements. Increasing the width of the data path also requires larger registers and more complex arithmetic logic units (ALUs), thereby increasing the size and complexity of the microprocessor.
Usable processing power may be increased by increasing the number of types of data operations implemented. Some basic operations, such as integer add/subtract, can be implemented with relatively simple circuits, which are low cost and small. However, more complicated operations, such as floating point multiply, require more complex and larger circuits. To maximize performance, circuits are often designed specifically to implement individual operations, which proliferate the number of processing circuits in a microprocessor. Consequently, microprocessor chips are expanding in size and cost to support more operations and more complex operations.
Minimizing costs and chip area are important goals in microprocessor design. Therefore, an execution unit data path which processes large data streams within multiple data elements and allows complex operations to be performed while reducing both chip size and costs is desired.
In accordance with an aspect of the invention, a processor provides a data path wherein the input data stream is divided into smaller data xe2x80x9cslicesxe2x80x9d. Processing occurs on each data slice in parallel with the other slices, thereby allowing larger data widths and multiple data elements to be processed in less time. In one embodiment, a 288-bit data stream is divided into eight 36-bit slices. Each of the 36-bit slices can support four 8-bit, four 9-bit, two 16-bit, or one 32-bit data elements. The 288-bit data stream is processed by performing parallel operations on the eight 36-bit slices. In a similar manner, any large data stream can be handled by processors containing smaller functional and arithmetic units by adding smaller width data paths in parallel.
According to another aspect of the invention, a processor execution unit is divided into smaller functional units, and instead of executing each complicated multi-cycle instruction with a single complex circuit, the smaller functional units are chained together as required to execute complicated instructions. The smaller functional units, which can be single-cycle units, can be shared or used for a number of different types of instructions so that the total amount of processing circuitry is reduced. One embodiment performs 36-bit integer multiply, integer multiply-and-accumulate (MAC), floating point add/subtract, and floating point multiply operations using a combination of single-cycle multipliers, arithmetic logic units (ALUs), and accumulators.
For integer multiply, a first functional unit is a 32-bit multiplier that generates a 64-bit partial carry and a 64-bit partial sum in the first clock cycle. In the second clock cycle, a 36-bit adder contained in a first ALU adds the 32 low bits of the partial carry and sum, and a 36-bit adder in a second ALU adds the 32 high bits of the partial carry and sum. The second ALU also adds a possible incoming carry bit from the first ALU when the adders add data types with widths greater than 36 bits. The output of the two ALUs can be stored in an accumulator or in a register file as the product of two integers.
The operations for integer MAC are the same as for integer multiply, except that in the first clock cycle, a value in the accumulator which is to be added is transferred to the two ALUs. In the second clock cycle, the first and second ALUs then add the accumulator bits as well as the partial sum and carry bits to provide a result to the accumulator or register file. Therefore, both integer multiply and integer MAC are executed in two clock cycles, sharing a multiplier, two ALUs, and an accumulator.
Similarly, floating point add/subtract and floating point multiply operations can be simplified using the same multipliers, ALUs, and accumulators as for the integer operations. For floating point add/subtract, the first ALU, in a first clock cycle, aligns the exponents by determining the difference between the exponents and right shifting the smaller mantissa by an amount equal to the exponent difference. The common exponent is the larger of the two operands. Also in the first cycle, the first ALU adds the mantissas and transfers the result to the second ALU if the shift amount was one or less. Otherwise, the aligned operands are transferred directly to the second ALU.
In the second clock cycle, the second ALU adds the mantissas of the aligned operands if the shift amount was greater than one. The result, either from the first ALU or from the adder in the second ALU, is normalized by right shifting the mantissa and incrementing the exponent if overflow occurs or by left shifting the mantissa and subtracting that amount from the common exponent, where the shift amount is equal to the number of leading zeros. Floating point add/subtract is completed in the second clock cycle after the result is rounded in the second ALU, according to one of four rounding modes existing in this embodiment. The result is then transferred and stored.
Whereas the above three multi-cycle instructions required two clock cycles to complete, floating point multiply requires three clock cycles. The same multiplier as above generates a carry and a sum from the mantissas of the two operands in the first clock cycle. In the second clock cycle, the first ALU adds the most significant bits of the carry and sum and also adds the two exponents of the operands. In the third clock cycle, the second ALU normalizes and rounds the result and then transfers the final result.
Consequently, a microprocessor that might have required four large and expensive circuits to execute the above-mentioned multi-cycle instructions can now execute the same instructions by employing a single-cycle multiplier, two single-cycle ALUs, and an accumulator according to one embodiment of the invention. Because these single-cycle units are typically smaller and inexpensive and because many functions can share various single-cycle units, the size and cost of a processor is reduced. Aspects of the invention can be applied to other instructions, data types, data widths, and data formats, and therefore these descriptions are not meant to be limiting.