Throughout the disclosure, the term “block” of video data is used to denote a subset of the data comprising a frame of video data having spatial location within a rectangular region of the frame. A block of video data can but need not consist of compressed (or otherwise encoded) video data. Examples of blocks of video data are the conventionally defined macroblocks of MPEG-encoded video frames.
In many conventional applications, image data (e.g., video data) or other data undergo a two-dimensional (“2D”) transform and the transformed data is later inverse transformed to recover the original data. Examples of such transforms include 2D discrete cosine transforms (two-dimensional “DCTs”), 2D Hadamard transforms, and 2D Fourier transforms.
Throughout the disclosure, the expression “bypassing” an operation (that would otherwise generate an operation output value) denotes generating or asserting a substitute output value (in place of the operation output value) without actually performing the operation. An example of “bypassing” an operation of asserting a zero value “z1” and a non-zero constant “c” to inputs of a multiplication circuit to cause the circuit to assert a current “cz1” at its output, asserting another zero value “z2” and different non-zero constant “d” to inputs of a second multiplication circuit to cause that circuit to assert a current “dz2” at its output, and operating an addition circuit in response to the currents “cz1” and “dz2” to assert an output voltage “cz1+dz2” (equal to zero volts above ground potential) at a node, would be to ground the node (thereby forcing it to ground potential) without actually performing the multiplication and addition steps in the multiplication circuits and addition circuit.
The present invention pertains to improved methods and systems for performing 2D transforms on 2D arrays of data values (i.e., arrays consisting of rows and columns of data values), where each of the values has a significant probability of being a zero value. In typical embodiments, the invention pertains to an improved method and system for performing an inverse transform of a 2D orthogonal transform (e.g., a 2D inverse discrete cosine transform or inverse Hadamard transform) on a 2D array of data values, where each of the values has a significant probability of being a zero value. In a class of preferred embodiments, the invention pertains to an improved method and system for performing a two-dimensional IDCT (2D inverse discrete cosine transform) on DCT coefficients. The DCT coefficients have been generated by performing a 2D discrete cosine transform on an array of video data (or other image data), and each has a significant probability of having the value zero.
Throughout this disclosure, the expression “zero value” (or “zero data value”) denotes data indicative of the value zero. Similarly, the expression “zero input data value” denotes input data indicative of the value zero. For example, a zero input value can be a word of input data (e.g., a DCT coefficient, or a color component or pixel of video data) having the value zero.
Throughout this disclosure, the expression “sparse” data (e.g., a sparse block of data to undergo an inverse transform) denotes data indicative of values that are likely to be zero values. For example, a block of input data (e.g., a block of DCT coefficients) indicative of relatively many zero values and relatively few non-zero values is a sparse block of data.
Inverse transform implementation is typically a major part of the implementation of any system to be compliant any video compression and decompression standard. It is a computationally intensive process and contributes significantly to processing cycle and power consumption requirements. Mobile devices that implement video compression and decompression standards (e.g., portable media players) have especially stringent processing cycle and power consumption requirements: they need to meet the stringent performance requirements set by the application and to consume very low power to maximize battery life; and the transform engine typically must be able to support multiple compression standards and varying requirements that come with these standards.
Typical conventional implementations of 2D transforms (including 2D inverse transforms) on blocks of data use the following techniques in different combinations to improve performance or reduce power:
1. avoiding transformation of blocks that are identified by an external means as being uncoded blocks (where each input block provided to the transform engine is identified by the external means as being a coded or uncoded block). However, this technique has disadvantages, including in that it can result in performance of unnecessary transform operations (e.g., transformation of blocks that are identified as coded blocks but consist only of zero DC coefficients);
2. identifying full rows or columns of each input data block that consist entirely of zero values (“zero-rows” or “zero-columns”) and bypassing normal transform operations that would otherwise be performed on each such row or column (e.g., by outputting predetermined values, typically “zero,” for each zero-row or zero-column). The zero-rows and zero-columns can either be specified by an external device or identified internally by the transform engine. However, this conventional technique does not improve performance or reduce power in many common situations in which a row (or column) is not a zero-row (or zero-column) but is a sparsely populated row (or column) including only a very small number of non-zero values;
3. identifying (from the input data) conditions that indicate that the same coefficients (previously determined for use in multiplying data values in an input data row or column) should be used for multiplying data values in a subsequent input data row or column, and avoiding the updating of such coefficients that would otherwise be performed to determine new coefficients for multiplying the data values in the subsequent input data row or column; and
4. implementing a distributed arithmetic transform (a lookup table-based implementation of a 2D transform). A typical lookup table-based implementation reduces overhead by reducing the number of multiplication operations that must be performed to transform a block. However, designing such an implementation is typically very complicated because very large ROM tables and also multi-ported ROM are typically required, and design constraints typically limit the improvement in power consumption that can be achieved.
In another conventional 2D transform, described in US Patent Application Publication No. 2005/0033788 and related U.S. Pat. No. 6,799,192, the last non-zero entry in each column of a block of data is determined (when performing a column transform phase of an LDCT), and the transform system then branches to an appropriate one of eight different “specialized IDCT” program routines for implementing IDCT operations in software to inverse-transform each column. Apparently, simpler transform operations (requiring fewer multiplication and addition operations) could be employed to process a column having relatively many zeros (as indicated by having the last non-zero value in a higher position) and more complicated transform operations (requiring more multiplication and addition operations) could be employed to process a column having fewer zeros (as indicated by having the last non-zero value in a lower position). The references also teach that when performing a row transform phase of the IDCT (after the column transform phase), the last non-zero entry in each row of a block is determined and the transform system then branches to an appropriate one of eight different “specialized IDCT” program routines for implementing IDCT operations in software to inverse-transform each row.
There are a number of problems and limitations with the technique described in US Patent App. Publication No. 2005/0033788 and U.S. Pat. No. 6,799,192, including that the technique is inefficient in the sense that it does not improve performance or reduce power consumption when processing many columns and rows having typical patterns of zero and non-zero values. For example, when a column or row to be transformed includes zeros (especially, many zeros) but has a last entry that is non-zero, the technique would select a complicated (e.g., the most complicated) “specialized IDCT” routine that consumes much power to transform the column or row. In contrast, preferred embodiments of the present invention improve performance and reduce power consumption by avoiding transform operations on portions of rows and columns that consist of zero values (e.g., on each half-row or half-column, or each quarter-row or quarter-column, that consists of zero values) or performing such transform operations in a reduced-power manner. Some preferred embodiments of the present invention improve performance and reduce power consumption by avoiding transform operations on each individual zero value in a row or column to be transformed (or performing transform operations on each individual zero value in a row or column in a reduced-power manner).
There is no suggestion in US Patent App. Publication No. 2005/0033788 or U.S. Pat. No. 6,799,192 that the performance improvement and power consumption reduction benefits achievable by the technique disclosed therein can be increased by independently processing subsets of each row or column to be transformed, and not suggestion as to how to do so or as to whether it is possible to do so. In contrast, preferred embodiments of the present invention can sequentially perform the same operations on different subsets of each row or column to be transformed (e.g., inverse transformed), where the subsets of each row or column determine a partition of the row or column, and the performance improvement and power consumption reduction benefits achievable by such embodiments can be increased simply by decreasing the size of the subsets that determine each such partition. For example, some preferred embodiments of the present invention sequentially perform sets of operations on 2N-bit subsets of each 8N-bit row or column to be transformed (four sets of operations per row or column) to achieve excellent performance improvement and power consumption reduction benefits, and other preferred embodiments of the invention sequentially perform sets of operations on N-bit subsets of each 8N-bit row or column to be transformed (eight sets of operations per row or column) to achieve even better performance improvement and power consumption reduction benefits.
Another conventional 2D transform is described in the paper by Rohini Krishnan, et al., entitled “Design of a 2D DCT/IDCT Application Specific VLIW Processor Supporting Scaled and Sub-sampled Blocks,” 16th International Conference on VLSI Design, six pages (2003). This paper teaches asserting a downscaled version of full data block (e.g., an 8×4 block that has been generated by discarding even rows of an 8×8 block) to IDCT circuitry, and operating the IDCT circuitry to inverse-transform the downscaled block including by bypassing some of the IDCT circuitry that could otherwise have been used to inverse-transform the full block. This method can avoid calculation of output values that will eventually be discarded, but does not detect and skip operations that will not contribute in any way to the final result.
Another conventional 2D transform is described in U.S. Pat. No. 5,883,823. This transform identifies regions of an input block to be transformed, and processes each region differently (e.g., an IDCT is performed on all elements of some regions and an IDCT is performed only on non-zero elements of other regions). For example, U.S. Pat. No. 5,883,823 apparently teaches (at col. 10, line 53-col. 11, line 26) an IDCT computation in which a “regional” IDCT calculation is performed on all elements (whether zero or non-zero) of one quadrant of an 8×8 block (i.e., the 4×4 quadrant corresponding to the lowest frequency ranges), and another IDCT calculation is performed only on non-zero elements of each of the other three 4×4 quadrants of the 8×8 block (i.e., the three 4×4 quadrants corresponding to higher frequency ranges). However, U.S. Pat. No. 5,883,823 does not teach or suggest how to identify non-zero elements of each region for which an IDCT calculation is to be performed only on non-zero elements (or how efficiently to identify such non-zero coefficients), or how to perform an IDCT calculation only on non-zero elements of a region of a block, or how efficiently (and in a manner consuming reduced power) to perform such an IDCT calculation only on such non-zero elements.