1. Field of the Invention
The present invention relates to the field of computer controlled multi-media audio visual display. More specifically, the present invention relates to an efficient decoding process for decoding audio/video material represented as a digital bit stream encoded using the Digital Video (DV) standard.
2. Related Art
Audio/visual (AV) material is increasingly stored, transmitted and rendered using digital data. Digital video representation of AV material facilitates its usage with computer controlled electronics and also facilitates high quality image and sound reproduction. Digital AV material is typically compressed (xe2x80x9cencodedxe2x80x9d) in order to reduce the computer resources required to store and transmit the digital data. Digital AV material can be encoded using a number of well known standards including, for example, the DV (Digital Video) standard, the MPEG (Motion Picture Expert Group) standard and the JPEG standard. The encoding standards also specify the associated decoding processes as well.
The DV decoding process includes a sub-step called xe2x80x9cinverse quantizationxe2x80x9d which is also called xe2x80x9cde-quantization.xe2x80x9d Inverse quantization is a difficult part of the DV decoding process because the inverse quantization table that is used in DV decoding is not a pre-loaded matrix, as in MPEG decoding. Therefore, the quantization matrix used in DV decoding needs to be computed for each new 8xc3x978 pixel (or xe2x80x9cdataxe2x80x9d) block.
For example, FIG. 1 illustrates a step in the inverse quantization process of a DV decoder. For 8xc3x978-DCT (Discrete Cosine Transform) mode, an input 8xc3x978 block of data 10 is multiplied by an 8xc3x978 quantization matrix 20 to produce an 8xc3x978 DCT matrix of coefficients 30. Each X coefficient (or xe2x80x9cpixelxe2x80x9d) of matrix 10 is multiplied by its associated Q coefficient of matrix 20 to produce a resultant coefficient in the 8xc3x978 DCT matrix 30. The 8xc3x978 DCT matrix 30 is the output of the inverse quantization of the input pixel block 10. However, each quantization coefficient (Qij) for each associated pixel (Xij) in the 8xc3x978 matrix 10 is dynamically calculated based on certain parameters thereby making this computation very difficult to implement in a SIMD (Single Instruction Multiple Data) architecture.
Traditional general purpose processors perform inverse quantization in DV decoding using a very straight-forward but time consuming solution. For instance, in the prior art, the de-quantization coefficient (e.g., Qij) of each pixel element (e.g., Xij) is computed one-by-one, in a serial fashion, and then multiplied by its associated pixel value (e.g., Xij) and the result is stored in the DCT matrix 30. This is done serially for each of the 64 coefficients (X00-X77). That means, for each pixel (e.g., Xij) of the 8xc3x978 block 10, at least one load instruction, one store instruction and one multiply (or shift) instruction are needed. This does not even include the time required to create the quantization coefficients (Qij) for each pixel (Xij) which are obtained from macroblock and block parameters. Therefore, using the conventional approach described above, it takes the general purpose processor more than 200 instructions to completely process one 8xc3x978 data block 10 through inverse quantization to create the DCT matrix 30.
Considering that DV decoding should be done in real-time to avoid image jitter and other forms of visual and/or audio artifacts with respect to the AV material, what is desired is a more efficient mechanism and method for performing inverse quantization to produce a DCT matrix 30 within a DV decoder.
Accordingly, the present invention provides a more efficient mechanism and method for performing inverse quantization within a DV decoder to produce a DCT matrix. The present invention performs up to eight multiply instructions in parallel for multiplying eight pixels (X) against eight quantization coefficients (Q) to simultaneously produce eight DCT coefficients using, in one embodiment, a 64-bit SIMD type media instruction set (and architecture) and a special quantization matrix. In another embodiment, a 128-bit SIMD type media instruction set (and architecture) can be used.
An efficient digital video (DV) decoder process is described herein that utilizes a specially constructed quantization matrix allowing an inverse quantization subprocess to perform parallel computations, e.g., using SIMD (Single Instruction Multiple Data) processing, to efficiently produce a matrix of DCT coefficients. The inverse quantization subprocess efficiently produces a matrix of DCT (Discrete Cosine Transform) coefficients. The present invention can take advantage of the SIMD architecture because it generates a vector containing the desired values which can then be processed in parallel. In the inverse quantization process of DV decoding, obtaining the quantization scale vectors is complex. One embodiment of the present invention utilizes 15 pre-defined quantization scales (a vector, also called herein an xe2x80x9carrayxe2x80x9d) to dynamically build an 8xc3x978 quantization matrix using one shift instruction for each row of the matrix. Therefore, one load instruction and seven shift instructions are needed for obtaining an 8xc3x978 quantization matrix for an 8xc3x978 pixel block.
The present invention utilizes a first look-up table (for 8xc3x978 DCT mode) which produces a 15-valued array based on class number information, area number information and a quantization (QNO) number for an 8xc3x978 data block (xe2x80x9cdata matrixxe2x80x9d or xe2x80x9cpixel blockxe2x80x9d) from the header information decoded from the encoded digital bitstream. The 8xc3x978 data block is produced from a variable length decoding and inverse scan subprocess. An individual 8-valued segment of the 15-value array is multiplied by an individual 8-valued segment, e.g., xe2x80x9ca row,xe2x80x9d of the 8xc3x978 data matrix to produce an individual row of the 8xc3x978 matrix of DCT coefficients (xe2x80x9cDCT matrixxe2x80x9d). The above eight multiplications can be performed in parallel using a SIMD architecture to simultaneously generate the row of eight DCT coefficients. In this way, eight passes through the 8xc3x978 data block are used to produce the entire 8xc3x978 DCT matrix; in one embodiment this consumes only 33 instructions per 8xc3x978 data block. After each pass, the 15-valued array is shifted by one value to update its quantization coefficients for proper alignment with its associated row of the data block. This continues until all rows of the data block are processed. The DCT matrix is then processed by an inverse discrete cosine transformation subprocess that generates decoded display data. A second lookup table can be used for 2xc3x974xc3x978 DCT mode processing.
One embodiment of the present invention is applied for the software DV decoder on a microprocessor with 128-bit registers and a multi-media instruction set. This instruction set includes an instruction to multiply 8 16-bit values from one register with 8 16-bit values from the other register to simultaneously produce 8 results and shifting two concatenating registers (256-bit) together for certain bytes. By using these media instructions and 128-bit wide bandwidth, not only are the execution cycles reduced by the present invention, but the memory access latency for the quantization matrix is also reduced to one access. In this implementation, 33 instructions are used to de-quantize one 8xc3x978 block for both 8xc3x978 DCT mode and for 2xc3x974xc3x978 DCT mode.
In an alternate embodiment of the present invention, a 64-bit SIMD architecture can also be used. Within the 64-bit SIMD instructions, two multiplication instructions can be applied for each row of the 8xc3x978 matrix. Therefore, cycles spent on multiplication are doubled compared to the 128-bit SIMD embodiment. However, the generation of the quantization matrix is analogous to the 128-bit SIMD embodiment.
More specifically, embodiments of the present invention includes, in a digital DV decoding process, a method of performing de-quantization comprising the steps of: a) obtaining a multi-valued array of quantization coefficients by referencing memory stored information with class information and a quantization number that are associated with a block of data coefficients representing encoded information; b) multiplying data coefficients of a respective row of the block of data coefficients with quantization coefficients of a designated portion of the multi-valued array in parallel to simultaneously produce a respective row of coefficients within a discrete cosine transform (DCT) matrix; c) shifting the multi-valued array by one value to update quantization coefficients of the designated portion; and d) completing the DCT matrix by repeating steps b)-c) for all rows of the block of data coefficients. Embodiments include the above and wherein the multi-valued array comprises 15 quantization coefficients and wherein the respective row of the block comprises eight data coefficients and wherein the designated portion of the multi-valued array comprises eight quantization coefficients and wherein the step b) comprises the step of producing eight DCT coefficients in parallel by simultaneously multiplying said eight data coefficients by said eight quantization coefficients.