It is desirable to compress image signals that are used with computer systems, since an image signal for a single uncompressed high-resolution digitized color image can easily consume several megabytes of memory. Because images tend to have low information content, very good compression rates are usually possible. This is especially true if, as is often the case for image signals used with computer systems, perfect reproduction is not required. In such instances, the low-frequency components of a frequency-domain image signal are perceptually more important in reproducing the image than the high-frequency components of the frequency-domain image signal. Thus, compression schemes that are applied to the frequency-domain version of an image signal do not waste bits in attempting to represent the relatively less significant high-frequency portions of the image signal.
Accordingly, it is desirable to transform an image signal from the spatial domain (also referred to as the “color domain”) to the frequency domain prior to compressing the image signal. Naturally, an inverse operation is required to transform the image signal from the frequency domain back into the color domain prior to representation on a screen, such as a computer monitor. One type of mathematical transform that is suitable for transforming image signals in this manner is the discrete cosine transform (DCT). A DCT includes a pair of transforms, namely a forward DCT (FDCT), which maps a digitized (color-domain) signal to a frequency domain signal, and an inverse DCT (IDCT), which maps a frequency domain signal to a signal in the color domain. The DCT and IDCT are important steps in several international standards such as MPEG-2 (H.262), MPEG-1 (H.261) and MPEG-4 (H.263).
However, although the encoding of an image signal using the FDCT provides a definite advantage in terms of compression, there is an associated penalty in terms of the amount of processing required to decode the image signal when successive images are to be displayed. That is to say, a processing unit designed to decode image signals must be capable of performing the IDCT operation sufficiently quickly to allow full-frame video playback in real time.
By way of example, to perform decoding of a 1024-by-1024-pixel color image requires 49,152 IDCTs of size 8-by-8 (namely, 3×(1024×1024)/(8×8)). Furthermore, the calculation of a single 8-by-8 IDCT in a conventional manner requires more than 9,200 multiplications and more than 4,000 additions. Thus, if 30 images are to be displayed in each second, as is suggested to provide full-motion video, then the total number of multiplications per second rises to over 13 billion and the total number of additions per second reaches more than 5 billion. Such tremendous processing requirements heavily influence the design of the processing unit hardware and software.
With the aim of providing the requisite processing power, several approaches have been considered. One such approach consists of performing a set of single-dimensional IDCTs on the individual rows of each successive received matrix of frequency-domain values, followed by a series of single-dimensional IDCTs on the resultant columns. Such a technique can reduce the number of multiplications and additions required to perform a complete IDCT. It has also been known to perform fused multiply-add instructions to further reduce the number of computations involved in performing the IDCT. In both of these cases, however, the decoding speed of the algorithms is directly dependent on the degree to which the main processor of the host system is occupied with other tasks, such as input/output (I/O) handling.
Other prior approaches have consisted of using a relatively simple software algorithm in conjunction with dedicated hardware support in the form of a dedicated IDCT co-processor. The use of a dedicated IDCT co-processor in the GPU has the potential to offer a more scalable solution since the decoding speed of the IDCT is no longer tied to the speed or availability of the main processor of the host system. However, the addition of a dedicated IDCT co-processor increases the cost and complexity of the hardware employed to effect video decoding.
Yet another way of providing hardware support for IDCT execution has consisted of providing specialized logic gates in the GPU. This approach affords marginal reductions in cost and complexity relative to the IDCT co-processor approach, while continuing to provide a decoding speed that is independent of the processing speed or availability of the main processor. However, the sheer amount of semiconductor real estate occupied by specialized logic gates capable of implementing a standard IDCT algorithm leave little room for the implementation of other important functional blocks of the GPU.
Therefore, it would be advantageous to enable fast decoding of the IDCT in a GPU but without the need to provide additional hardware and without unduly monopolizing the resources of the main processor in the GPU.