1. Field of the Invention
The present invention relates to the field of hardware used for implementing arithmetic operations such as processor instructions. More specifically, the present invention relates to a multiplier circuit capable of performing operations on operands of various data types and also for signed and non-signed binary values.
2. Related Art
Hardware multipliers are an indispensable component of every computer system, cellular phone and most digital audio/video equipment. In real-time applications (e.g., flight simulators, speech recognition, video teleconferencing, computer games, streaming audio/video etc.), the overall system performance is heavily dependent on the speed of the internal multipliers. For instance, processing digital images at 30 frames/second requires nearly 2.2 million multiply operations per second. Therefore, designing fast multipliers that occupy smaller areas on the integrated circuit (IC) chip and that consume less power is essential to a successful product.
In multimedia applications, multipliers are used to perform a wide range of functions such as Inverse Discrete Cosine (IDCT), Fast Fourier Transforms (FFT), and Multiply Accumulate (MAC) on 8-bit, 16-bit, and 32-bit signed and unsigned operands. It would be advantageous to provide a multiplier device which can support a variety of data formats. One effort to produce a multiplier that can support a variety of data formats resulted in multi-cycle multipliers.
FIG. 1 illustrates the operation 10 of a multi-cycle multiplier of the prior art. In the multi-cycle multiplier, a smaller multiplier circuit (e.g., 8xc3x978 bit) is used to compute partial products (e.g., step 12) which are accumulated together (e.g., step 14) to form the final result. The multi-cycle or xe2x80x9citerativexe2x80x9d method uses a basic multiplier to perform the multiplication for larger word lengths. This method does not allow high throughput for large word lengths, and although it may result in a shorter delay for 8-bit operations, the extra cycles to perform 16-bit and 32-bit operations result in serious side effects such as longer delay, more wiring, bypassing, and unwanted stalls in the pipeline. Table I shows the number of clock cycles needed for partial product reduction using a typical 8xc3x978 bit multiplier circuit for performing 8-bit, 16-bit, and 32-bit multiplications.
As discussed above, there are numerous disadvantageous with the prior art multi-cycle multiplier approach, such as, larger cycle latency, smaller throughput, and perhaps worst of all, different timing delays for different data formats, which would result in creating stalls in the pipeline when dealing with wider numbers.
Recently, Hideyuki proposed in a reference entitled, xe2x80x9cMatrix Vector Multiplier (MVM) Dedicated to Video Decoding and 3-D Computer Graphics,xe2x80x9d by Hideyuki et., al., IEEE Transactions on Circuits and Systems for Video Technology, Volume: 9,2, March 1999, pages 306-314, the matrix vector multiplier (MVM) dedicated to video decoding and 3-D computer graphics. This multiplier supports multiple operations on 16-bit and 32-bit unsigned operands using only one multiplier, at the cost of a very low speed 20 MHz. Like other multipliers using the iterative method, many extra cycles are required to perform the 32-bit multiply operations which reduces the overall performance of this device. It would be advantageous to provide a multiplier circuit design that could support a variety of data formats (e.g., lengths) without consuming extra cycles for multiply operations on larger operands.
An Intel design is described in a reference entitled, xe2x80x9cA 600 MHz IA-32 Microprocessor with Enhanced Data Streaming for Graphics and Video,xe2x80x9d by Stephen Fischer, Digest of Technical Papers, ISSCC 1999, pages 98-450. In this design approach, two separate hardware multipliers are used to perform two 16xc3x9716 bit multiplications. Since these multipliers are not partitioned, this approach does not allow the flexibility to use these multipliers for a variety of data formats and the duplication of circuitry consumes large amounts of area and consumes large amounts of power. Moreover, extra cycles are required to perform 32-bit operations because the iterative method is required for operands larger than 16-bits. Lastly, this design does not allow much parallelism for 8-bit operations.
The second prior art method for performing multiplication that supports a variety of data formats uses separate hardware for different data types. For instance, a separate 32xc3x9732 bit multiplier circuit, a separate 16xc3x9716 bit multiplier circuit and a separate 8xc3x978 bit multiplier circuit are included within a single multiplier device. However, using separate hardware for different data types can become extremely costly because it requires large amounts of chip area and consumes more power.
An AltiVec design by Motorola is described in a paper entitled, xe2x80x9cA Low Power, High Speed Implementation of a PowerPC Microprocessor Vector Extension,xe2x80x9d by Martin S. Schmookler et. al., presented at 14th IEEE Symposium on Computer Arithmetic, 1999. This is the first architecture which supports multiplication on 8-bit and 16-bit signed and unsigned operands. However, like the Intel design described above, this prior art design uses redundant/separate hardware for performing 8-bit and 16-bit multiplications. It would be advantageous to provide a multiplier circuit design that could support a variety of data formats without consuming large amounts of area and power.
Accordingly, the present invention provides a multiplier design that accepts a large variety of data formats but does not require iterative steps (e.g., multi-cycling) to perform large operand multiplication thereby providing very fast operational performance. The present invention advantageously provides constant cycle latency for any operand size from 8-bit, 16-bit and 32-bit and does not perform multiplier multi-cycling for larger operands. Further, the present invention provides a multiplier design that accepts a large variety of data formats but does not utilize multiplier circuitry duplication thereby providing a hardware efficient and energy efficient device.
A partitioned multiplier circuit is described herein which is designed for high speed operations. The multiplier of the present invention can perform one 32xc3x9732 bit multiplication, two 16xc3x9716 bit multiplications (simultaneously) or four 8xc3x978 bit multiplications (simultaneously) depending on input partitioning signals. The time required to perform either the 32xc3x9732 bit or the 16xc3x9716 bit or the 8xc3x978 bit multiplications is the same due to the design of the present invention. Multiplication results are available with a constant latency (e.g., two clock cycles in one embodiment) regardless of the operand bit-size. In the embodiment that requires two clock cycle latency, the multiplier circuit has a throughput of one clock cycle due to pipelining. The input operands can be signed or unsigned. The hardware is partitioned without any significant increase in the delay or area and the multiplier can provide six different modes of operation. In one embodiment, Booth encoding is used for the generation of 17 partial products which are compressed using a compression tree into two 64-bit values. This is performed in the first pipeline stage to generate a 64-bit sum vector and a 64-bit carry vector. These values are then added, in the second pipestage, using a carry propagate adder circuit to provide a single 64-bit result. In the case of 16xc3x9716 bit multiplication, the 64-bit result contains two 32-bit results. In the case of 8xc3x978 bit multiplication, the 64-bit result contains four 16-bit results. Due to its high operating speed, the multiplier circuit is advantageous for use in multi-media applications, such as audio/visual rendering and playback.
More specifically, an embodiment of the present invention includes a partitioned multiplier comprising: a sign extension and partitioning circuit receiving a 32-bit multiplicand and producing a 64-bit extended multiplicand; a booth encoder and selector circuit receiving the 64-bit extended multiplicand and receiving a 32-bit multiplier, the booth encoder and selector circuit simultaneously generating 17 partial products properly partitioned for performing byte, half-word (16-bit) and word (32-bit) multiply operations based on a partition signal, wherein partial products 6-17 are zero for the byte multiply operations and wherein partial products 10-17 are zero for the half-word multiply operations; a compressor tree receiving the 17 partial products and generating therefrom a sum vector and a carry vector; and an adder circuit adding the sum and the carry vectors and producing a 64-bit output, wherein the 64-bit output is generated with two cycle latency and single cycle throughput for each of the byte, half-word and word multiply operations.
Embodiments include the above and wherein the multiplicand and the multiplier each comprise four 8-bit operands, the multiplier simultaneously performing four 8xc3x978 bit multiply operations and wherein the 2n-bit output comprises four 16-bit results. Embodiments include the above and wherein the multiplicand and the multiplier each comprise two 16-bit operands, the multiplier simultaneously performing two 16xc3x9716 bit multiply operations and wherein the 2n-bit output comprises two 32-bit results. Embodiments include the above and wherein the multiplicand and the multiplier each comprise one 32-bit operand, the partitioned multiplier performing one 32xc3x9732 bit multiply operation and wherein the 64-bit output comprises one 64-bit result.