1. Field of the Invention
This invention generally relates to video compression techniques and, more particularly, to a method for reducing the bit size required in the computation of video coding transformations.
2. Description of the Related Art
A video information format provides visual information suitable to activate a television screen, or store on a video tape. Generally, video data is organized in a hierarchical other. A video sequence is divided into group of frames, and each group can be composed of a series of single frames. Each frame is roughly equivalent to a still picture, with the still pictures being updated often enough to simulate a presentation of continuous motion. A frame is further divided into slices, or horizontal sections which helps system design of error resilience. Each slice is coded independently so that errors do not propagate across slices. A slice consists of macroblocks. In H.26P and Motion Picture Experts Group (MPEG)-X standards, a macroblock is made up of 16×16 luma pixels and a corresponding set of chrome pixels, depending on the video format. A macroblock always has an integer number of blocks, with the 8×8 pixel matrix being the smallest coding unit.
Video compression is a critical component for any application which requires transmission or storage of video data. Compression techniques compensate for motion by reusing stored information in different areas of the frame (temporal redundancy). Compression also occurs by transforming data in the spatial domain to the frequency domain. Hybrid digital video compression, exploiting temporal redundancy by motion compensation and spatial redundancy by transformation, such as Discrete Cosine Transform (DCT), has been adapted in H.26P and MPEG-X international standards as the basis.
As stated in U.S. Pat. No. 6,317,767 (Wang), DCT and inverse discrete cosine transform (IDCT) are widely used operations in the signal processing of image data. Both are used, for example, in the international standards for moving picture video compression put forth by the MPEG. DCT has certain properties that produce simplified and efficient coding models. When applied to a matrix of pixel data, the DCT is a method of decomposing a block of data into a weighted sum of spatial frequencies, or DCT coefficients. Conversely, the IDCT is used to transform a matrix of DCT coefficients back to pixel data.
Digital video (DV) codecs are one example of a device using a DCT-based data compression method. In the blocking stage, the image frame is divided into N by N blocks of pixel information including, for example, brightness and color data for each pixel. A common block size is eight pixels horizontally by eight pixels vertically. The pixel blocks are then “shuffled” so that several blocks from different portions of the image are grouped together. Shuffling enhances the uniformity of image quality.
Different fields are recorded at different time incidents. For each block of pixel data, a motion detector looks for the difference between two fields of a frame. The motion information is sent to the next processing stage. In the next stage, pixel information is transformed using a DCT. An 8-8 DCT, for example, takes eight inputs and returns eight outputs in both vertical and horizontal directions. The resulting DCT coefficients are then weighted by multiplying each block of DCT coefficients by weighting constants.
The weighted DCT coefficients are quantized in the next stage. Quantization rounds off each DCT coefficient within a certain range of values to be the same number. Quantizing tends to set the higher frequency components of the frequency matrix to zero, resulting in much less data to be stored. Since the human eye is most sensitive to lower frequencies, however, very little perceptible image quality is lost by this stage.
The quantization stage includes converting the two-dimensional matrix of quantized coefficients to a one-dimensional linear stream of data by reading the matrix values in a zigzag pattern and dividing the one-dimensional linear stream of quantized coefficients into segments, where each segment consists of a string of zero coefficients followed by a non-zero quantized coefficient. Variable length coding (VLC) then is performed by transforming each segment, consisting of the number of zero coefficients and the amplitude of the non-zero coefficient in the segment, into a variable length codeword. Finally, a framing process packs every 30 blocks of variable length coded quantized coefficients into five fixed-length synchronization blocks.
Decoding is essentially the reverse of the encoding process described above. The digital stream is first deframed. Variable length decoding (VLD) then unpacks the data so that it may be restored to the individual coefficients. After inverse quantizing the coefficients, inverse weighting and an inverse discrete cosine transform (IDCT) are applied to the result. The inverse weights are the multiplicative inverses of the weights that were applied in the encoding process. The output of the inverse weighting function is then processed by the IDCT.
Much work has been done studying means of reducing the complexity in the calculation of DCT and IDCT. Algorithms that compute two-dimensional IDCTs are called “type I” algorithms. Type I algorithms are easy to implement on a parallel machine, that is, a computer formed of a plurality of processors operating simultaneously in parallel. For example, when using N parallel processors to perform a matrix multiplication on N×N matrices, N column multiplies can be simultaneously performed. Additionally, a parallel machine can be designed so as to contain special hardware or software instructions for performing fast matrix transposition.
One disadvantage of type I algorithms is that more multiplications are needed. The computation sequence of type I algorithms involves two matrix multiplies separated by a matrix transposition which, if N=4, for example, requires 64 additions and 48 multiplications for a total number of 112 instructions. It is well known by those skilled in the art that multiplications are very time-consuming for processors to perform and that system performance is often optimized by reducing the number of multiplications performed.
A two-dimensional IDCT can also be obtained by converting the transpose of the input matrix into a one-dimensional vector using an L function. Next, the tensor product of constant a matrix is obtained. The tensor product is then multiplied by the one-dimensional vector L. The result is converted back into an N×N matrix using the M function. Assuming again that N=4, the total number of instructions used by this computational sequence is 92 instructions (68 additions and 24 multiplications). Algorithms that perform two-dimensional IDCTs using this computational sequence are called “type II” algorithms. In type II algorithms, the two constant matrices are grouped together and performed as one operation. The advantage of type II algorithms is that they typically require fewer instructions (92 versus 112) and, in particular, fewer costly multiplications (24 versus 48). Type II algorithms, however, are very difficult to implement efficiently on a parallel machine. Type II algorithms tend to reorder the data very frequently and reordering data on a parallel machine is very time-intensive.
There exist numerous type I and type II algorithms for implementing IDCTs, however, dequantization has been treated as an independent step depending upon DCT and IDCT calculations. Efforts to provide bit exact DCT and IDCT definitions have led to the development of efficient integer transforms. These integer transforms typically increase the dynamic range of the calculations. As a result, the implementation of these algorithms requires processing and storing data that consists of more than 16 bits.
It would be advantageous if intermediate stage quantized coefficients could be limited to a maximum size in transform processes.
It would be advantageous if a quantization process could be developed that was useful for 16-bit processors.
It would be advantageous if a decoder implementation, dequantization, and inverse transformation could be implemented efficiently with a 16-bit processor. Likewise, it would be advantageous if the multiplication could be performed with no more than 16 bits, and if memory access required no more than 16 bits.