This application specifically relates to a method and apparatus for reducing the decode complexity of two dimensional inverse transforms on a vector process.
A typical digital video decoding system involves the following steps (among others).
For each block in a frame:                A) Extract quantized transform coefficients from the compressed bit-stream        B) Perform inverse quantization to reconstruct the transform coefficients        C) Perform an inverse transform (typically an IDCT) on the coefficients        D) Add the resultant values to a block predictor        E) Output the block results        
The 2-dimensional inverse transform functions typically take a large portion of the time to decode a frame due to their complexity.
The invention described here attempts to reduce the decoder complexity on vector processing machines that are capable of doing the same operation to multiple values stored sequentially in a machine's registers by lowering the complexity of the 2 dimensional transform.
A 2-dimensional separable inverse transform performed on a block typically involves performing the following steps:                a) For each row of the block:                    Perform the same 1-dimensional inverse transform on the transform coefficients.                        b) For each column of the block column (resulting from (a)):                    Perform the same 1-dimensional inverse transform on the transform coefficients.                        
Since the 1-dimensional inverse transform usually involves performing exactly the same operations on a number of rows or columns in the block, vector processors are often used to reduce the decoding time. This is typically accomplished by filing vector processing registers with a value from each of N rows in the block (see diagram). The operations of the inverse transform are then performed on the N rows in parallel. And then the vector processing registers are filled with values from each of the N columns in the block and the inverse transform is then performed on the N columns in parallel.
In order to fill the vector processing registers quickly with different values from each row a programmer typically has two options:                a) Transpose the coefficients so that the transform coefficients appear in the order that matches the vector processor and load them directly into the registers.        b) Fill the vector registers one value at a time with the coefficient data.        
Choice (a) requires numerous operations to perform the transpose and choice (b) requires numerous bit-mask AND/OR operations to place each coefficient into the register.