Forward discrete cosine transforms ("DCT operations" or "DCT transforms") are a well known class of discrete time to frequency domain transforms. Inverse discrete cosine transforms ("IDCT operations" or "IDCT transforms") are a well known class of discrete frequency to time domain transforms. DCT and IDCT operations (sometimes referred to herein collectively as "discrete transform" operations or "discrete transforms") are employed to transform input data signals in many applications, with their specifications set by international standards bodies.
In voice and image compression systems, input signals are often transformed by DCT circuitry because the DCT transform is very well suited for decorrelating real-valued signals and concentrating their information content in low frequency components.
DCT and IDCT operations are used in video conferencing in accordance with the standard established by the CCITT Recommendation H.261. DCT and IDCT operations are also used for still image transmission in accordance with the JPEG standard set by the International Standardization Organization (ISO). DCT and IDCT operations are also used for transmission of moving images in accordance with the MPEG standard also set by ISO.
All application areas mentioned above require the calculation of a DCT as well as its inverse transform, the IDCT. Furthermore, in many cases either data are being compressed, requiring the DCT, or data are being decompressed, requiring the IDCT, but not both at the same time. To save on hardware complexity an implementation should therefore be designed such that it can be programmed to execute either the DCT or the IDCT. For image processing, a further requirement is that the implementation must be able to execute a two-dimensional DCT or IDCT. Due to the fact that this is typically accomplished by carrying out two sets of one-dimensional DCTs (or IDCTs) in sequence, this requirement translates (in the case that one DCT hardware unit is time-shared for both sets of one-dimensional DCTs) to the fact that the hardware should allow for a minimum number of wait cycles between the first and second set of one-dimensional transformations.
The computation of DCT and IDCT transforms are well known in the art. In "A Fast Recursive Algorithm For Computing The Discrete Cosine Transform" by Hsieh S. Hou, published in the Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP-35, No. 10, October, 1987 (p. 455-61), it was proposed that a recursive algorithm be used for computing a fast discrete cosine transform (either a fast forward discrete cosine transform or a fast inverse discrete cosine transform). One embodiment, shown in FIGS. 1 and 2 on page 1459 of the Hou article, employs a basic two point forward discrete cosine transform processing element. Hou does not disclose a universal processing element, in the sense of a single processing element used for all calculations of a forward discrete cosine transform. Because the processing element disclosed in Hou is not a uniform processing element, the apparatus disclosed in Hou cannot be modularly expanded to process input data signals having higher order number.
U.S. patent application Ser. No. 720,202, filed Jun. 24, 1991, and U.S. patent application Ser. No. 07/847,195, now abandoned, filed Mar. 6, 1992, both entitled "Method and Apparatus to Transform Time to Frequency and Frequency to Time of Data Signals" (and assigned to the assignee of the present application) also disclose methods and apparatus for performing N-point DCT and IDCT operations. However, these methods and apparatuses require multiple processing iterations to perform each DCT or IDCT transform, and thus may be unsuitable for achieving high transform rates.
Another known technique for performing a DCT transform is to perform full matrix multiplication, using a multiply-accumulate unit for each DCT coefficient which needs to be computed. For slowing down the computation rate, solutions have been proposed which use bit-serial arithmetic, distributed arithmetic, or which apply a time-sharing strategy of the multiply-accumulate units.
Another known technique for performing a DCT (or an IDCT) transform employs an integrated circuit designed by mapping a Fast Fourier Transform-like (FFT-like) decomposed flow of a DCT (or IDCT) transform directly into silicon. FIG. 1 is a simplified block diagram of such a conventional circuit for implementing a DCT transform. In response to each set of eight parallel input data values (x.sub.0 -x.sub.7), the FIG. 1 circuit outputs eight parallel data values (z.sub.0 -z.sub.7) representing a DCT transform of the input data. The circuit of FIG. 1 has two functional blocks: DCT shuffle-exchange processor 1 consisting of twelve identical butterfly circuits (or "units") 6; followed by post-processor 2 consisting of five subtraction circuits 4 and a fixed-coefficient multiplication unit (identified by the symbol A). Each butterfly circuit 6 includes an adder circuit, a subtraction circuit, and a fixed-coefficient multiplication unit. Four of the multiplication units are identified by the symbol A, two by the symbol B, two by the symbol C, and one each by the symbols D, E, F, and G. The multiplier circuits identified by symbols A, B, C, D, E, F, and G multiply their input values by fixed coefficients A=cos(.pi./4), B=cos(.pi./8), C=-sin(.pi./8), D=sin(.pi./16), E=sin(.pi./16), F=-cos(3.pi./16), and G=-sin(3.pi./16), respectively.
The FIG. 1 circuit implements every step of the DCT operation using a hardware unit. The hardware units are connected as shown in FIG. 1. If data values x.sub.0 -x.sub.7 are simultaneously asserted bits of an eight bit parallel word, several such eight bit words can be simultaneously subjected to a DCT transform by simultaneously applying them to a set of identical FIG. 1 circuits connected in parallel. Equivalently, each data value x.sub.j can be a multi-bit parallel word, with each butterfly unit 6 of FIG. 1 designed to implement bit-parallel arithmetic simultaneously on all bits of each word x.sub.j. With parallel processing of several input words, execution of a new DCT (to transform several words) can be started at each clock cycle, resulting in an extremely high transform rate.
For medium speed applications (requiring lower transform rates), the architecture of FIG. 1 can also be used. If only a lower overall processing rate is required, fewer FIG. 1 circuits can be connected in parallel (e.g., a single FIG. 1 circuit can be employed), each data value x.sub.j can represent a fewer number of parallel bits (e.g., a single bit), and bit-serial (or distributed) arithmetic can be implemented in each addition, subtraction, and multiplication unit of FIG. 1. For example, a single FIG. 1 circuit can be employed with each input line (e.g., the input line labeled x.sub.0 or the input line labeled x.sub.7) sequentially receiving the bits of a different serial input word, and each butterfly unit of shuffle-exchange circuit 1 serially processing the sequentially received bits.
Regardless of the particular design of each addition, subtraction, and multiplication unit thereof, the FIG. 1 design for a DCT processor is very simple and straightforward. FIG. 1 employs no control means, and the lack of a control means allows the DCT data-flow dependence graph to be directly interpreted as a data-flow architecture.
FIG. 2 is a simplified block diagram of a conventional circuit (similar to the FIG. 1 circuit) for implementing an IDCT operation. In response to each set of eight parallel input data values (z.sub.0 -z.sub.7), the FIG. 2 circuit outputs eight parallel data values (x.sub.0 -x.sub.7) representing an IDCT transform of the input data. The circuit of FIG. 2 has two functional blocks: pre-processor 3 (consisting of four subtraction units 4', one addition unit 8' and a fixed-coefficient multiplication unit identified by the symbol A) followed by IDCT shuffle-exchange processor 1' (consisting of twelve identical butterfly units 6'). Each unit 6' includes an addition circuit, a subtraction circuit, and a fixed-coefficient multiplication unit. Five of the multiplication units apply multiplicative coefficient A, two apply multiplicative coefficient B, two apply multiplicative coefficient C, and one each applies multiplicative coefficient D, E, F, and G. The same notation employed in FIG. 1 is employed in FIG. 2.
The basic functional block of the shuffle-exchange unit of FIG. 1 (and the shuffle-exchange unit of FIG. 2, which is identical to that of FIG. 1) is a butterfly unit comprising a subtraction/addition circuit pair connected to a multiplier. As in FIG. 1, the multiplicative coefficients C, F, and G are fixed negative values, and the other multiplicative coefficients are fixed positive values. For the reasons explained below, the design of FIGS. 1 and 2 unnecessarily increases the cost and complexity of each multiplier circuit of each butterfly unit of the FIG. 1 and 2 circuits. In preferred embodiments of the present invention, this disadvantage of the prior art is eliminated because the multipliers of the inventive apparatus apply only positive multiplicative coefficients.
A major problem with designing a single data-flow architecture for performing both DCT and IDCT operations is that the IDCT operational flow very much resembles the DCT flow, but only if the direction of data flow is drawn in the reversed direction (as can be seen by comparing FIG. 1 to FIG. 2). Thus, a combined DCT and IDCT architecture must be able to modify its data flow significantly from a DCT to an IDCT operation.