Typically, a discrete cosine transform (DCT) apparatus as shown in FIG. 77 performs a full two-dimensional (2-D) transformation of a block of 8.times.8 pixels by first performing a 1-D DCT on the rows of the 8.times.8 pixel block. It then performs another 1-D DCT on the columns of the 8.times.8 pixel block. Such an apparatus typically consists of an input circuit 1096, an arithmetic circuit 1104, a control circuit 1098, a transpose memory circuit 1090, and an output circuit 1092.
The input circuit 1096 accepts 8-bit pixels from the 8.times.8 block. The input circuit 1096 is coupled by intermediate multiplexers 1100, 1102 to the arithmetic circuit 1004. The arithmetic circuit 1104 performs mathematical operations on either a complete row or column of the 8.times.8 block. The control circuit 1098 controls all the other circuits, and thus implements the DCT algorithm. The output of the arithmetic circuit is coupled to the transpose memory 1090, register 1095 and output circuit 1092. The transpose memory is in turn connected to multiplexer 1100, which provides output to the next multiplexer 1102. The multiplexer 1102 also receives input from the register 1094. The transpose circuit 1090 accepts 8.times.8 block data in rows and produces that data in columns. The output circuit 1092 provides the coefficients of the DCT performed on a 8.times.8 block of pixel data.
In a typical DCT apparatus, it is the speed of the arithmetic circuit 1104 that basically determines the overall speed of the apparatus, since the arithmetic circuit 1104 is the most complex.
The arithmetic circuit 1104 of FIG. 77 is typically implemented by breaking the arithmetic process down into several stages as described hereinafter with reference to FIG. 78. A single circuit is then built that implements each of these stages 1114, 1148, 1152, 1156 using a pool of common resources, such as adders and multipliers. Such a circuit 1104 is mainly disadvantageous due to it being slower than optimal, because a single, common circuit is used to implement the various stages of circuit 1104. This includes a storage means used to store intermediate results. Since the time allocated for the clock cycle of such a circuit must be greater or equal to the time of the slowest stage of the circuit, the overall time is potentially longer than the sum of all the stages.
FIG. 78 depicts a typical arithmetic data path, in accordance with the apparatus of FIG. 77, as part of a DCT with four stages. The drawing does not reflect the actual implementation, but instead reflects the functionality. Each of the four stages 1144, 1148, 1152, and 1156 is implemented using a single, reconfigurable circuit. It is reconfigured on a cycle-by-cycle basis to implement each of the four arithmetic stages 1144, 1148, 1152, and 1156 of the 1-D DCT. In this circuit, each of the four stages 1144, 1148, 1152, and 1156 uses pool of common resources (e.g. adders and multipliers) and thus minimises hardware.
However, the disadvantage of this circuit is that it is slower than optimal. The four stages 1144, 1148, 1152, and 1156 are each implemented from the same pool of adders and multipliers. The period of the clock is determined by the speed of the slowest stage, which in this example is 20 ns (for block 1144). Adding in the delay (2 ns each) of the input and output multiplexers 1146 and 1154 and the delay (3 ns) of the flip-flop 1150, the total time is 27 ns. Thus, the fastest this DCT implementation can run at is 27 ns.
Pipelined DCT implementations are also well known. The drawback with such implementations is that they require large amounts of hardware to implement. Whilst the present invention does not offer the same performance in terms of throughput, it offers an extremely good performance/size compromise, and good speed advantages over most of the current DCT implementations.
Therefore, a need clearly exists for an improved DCT/inverse-DCT method and apparatus that is able to overcome one or more disadvantages of conventional techniques. In particular, a need clearly exists for a method and apparatus that is able to reduce the time taken for the main arithmetic circuit in a DCT/inverse-DCT apparatus to calculate required results, thereby improving the overall performance of the DCT or inverse DCT.