In the United States a standard has been proposed for digitally encoded high definition television signals. A portion of this standard is essentially the same as the MPEG-2 standard, proposed by the Moving Picture Experts Group (MPEG) of the International Organization for Standardization (ISO). The standard is described in a International Standard (IS) publication entitled, "Information Technology--Generic Coding of Moving Pictures and Associated Audio, Recommendation H.626", ISO/IEC 13818-2, IS, 11/94 which is available from the ISO and which is hereby incorporated by reference for its teaching on the MPEG-2 digital video coding standard.
The MPEG-2 standard is actually several different standards. In MPEG-2 several different profiles are defined, each corresponding to a different level of complexity of the encoded image. For each profile, different levels are defined, each level corresponding to a different image resolution. One of the MPEG-2 standards, known as Main Profile, Main Level is intended for coding video signals conforming to existing television standards (i.e., NTSC and PAL). Another standard, known as Main Profile, High Level is intended for coding high-definition television images. Images encoded according to the Main Profile, High Level standard may have as many as 1,152 active lines per image frame and 1,920 pixels per line.
The Main Profile, Main Level standard, on the other hand, defines a maximum picture size of 720 pixels per line and 567 lines per frame. At a frame rate of 30 frames per second, signals encoded according to this standard have a data rate of 720 * 567 * 30 or 12,247,200 pixels per second. By contrast, images encoded according to the Main Profile, High Level standard have a maximum data rate of 1,152 * 1,920 * 30 or 66,355,200 pixels per second. This data rate is more than five times the data rate of image data encoded according to the Main Profile Main Level standard. The standard proposed for HDTV encoding in the United States is a subset of this standard, having as many as 1,080 lines per frame, 1,920 pixels per line and a maximum frame rate, for this frame size, of 30 frames per second. The maximum data rate for this proposed standard is still far greater than the maximum data rate for the Main Profile, Main Level standard.
The MPEG-2 standard defines a complex syntax which contains a mixture of data and control information. Some of this control information is used to enable the signals having several different formats to be covered by the standard. These formats define images, having differing numbers of picture elements (pixels) per line, differing numbers of lines per frame or field and differing numbers of frames or fields per second. In addition, the basic syntax of the MPEG-2 Main Profile defines the compressed MPEG-2 bit stream representing a sequence of images in six layers, the sequence layer, the group of pictures layer, the picture layer, the slice layer, the macroblock layer, and the block layer. Each of these layers is introduced with control information. Finally, other control information, also known as side information, (e.g. frame type, macroblock pattern, image motion vectors, coefficient zig-zag patterns and dequantization information) are interspersed throughout the coded bit stream.
To effectively receive the digital images, a decoder must process the video signal information rapidly. To be optimally effective, the coding systems should be relatively inexpensive and yet have sufficient power to decode these digital signals in real time.
Using existing techniques, a decoder may be implemented using a single processor having a complex design and operating at a high data rate to perform this function. This high data rate, however, would require very expensive circuitry, which would be contrary to the implementation of a decoder in a consumer television receiver in which cost is a major factor.
Another alternative is to use a decoder employing parallel processing. Using parallel processing reduces the cost of the circuitry while maintaining the high data rates. FIG. 13 shows one such system. The decoder in FIG. 13 includes two parallel processing paths A and B. First, the input bit-stream is applied to router circuitry 5. Router circuitry 5 directs the bit-stream into different logically defined processing paths A and B. each path processing macroblocks from a respectively different slice of an MPEG-2 encoded image. Variable Length Decoders (VLD) 10a and 10b decode the separated data streams to generate blocks of quantized discrete cosine transform (DCT) coefficient values. These blocks of values are applied to respective inverse zig-zag scan memories 15a and 15b to perform the inverse scan. The inverse quantizers 20a and 20b perform an inverse quantization of the quantized DCT values provided by inverse zig-zag scan memories 15a and 15b. The DCT coefficient values are provided to inverse discrete cosine transform (IDCT) circuits 25a and 25b. The output data of IDCT circuits 25a and 25b are blocks of pixel values or differential pixel values.
Each of the IDCT circuits 25a and 25b performs a 2-dimensional IDCT operation on the DCT coefficient values. An Inverse Discrete Cosine Transformation (IDCT) is performed, as discussed above, to reconstruct the original picture elements or pixels. An 8-point 1-D IDCT is shown equation (1): ##EQU1## where xn (n=0, 1, 2, . . . ,7) is the result of the matrix multiplication, Xn is a input coefficient value, and a, b, c, d, e, f, g are constants in the IDCT matrix. Intermediate coefficient values are produced using equations (2): ##EQU2## Each IDCT circuit implements the 1-D IDCT of equations (1) and (2) twice. The values Xn provided to the first 1-D are DCT coefficient values and the output value produced xn is an intermediate coefficient value. The input value Xn to the second 1-D IDCT are the transposed intermediate coefficient values xn from the first 1-D IDCT. The output values xn of the second 1-D IDCT are pixel values. Equation (1) includes matrix multiplication to calculate an inner product.
One method of calculating the inner product is distributed arithmetic. Distributed arithmetic is a bit-serial computational operation that forms an inner product of a pair of vectors. Distributed arithmetic has been used in the past to perform DCTs and IDCTs as shown in Maruyama, VLSI Architecture and Implementation of a Multi-Function Forward/Inverse Discrete Cosine Transform Processor, Visual Communications and Image Processing '90, Vol. 1360, pp. 410-417, and TWO-DIMENSIONAL DISCRETE COSINE TRANSFORM PROCESSOR, U.S. Pat. No. 4,791,598, (hereinafter the '598 patent) issued to Liou et al., each incorporated herein by reference for their teachings on distributed arithmetic to perform DCTs and IDCTs.
Distributed arithmetic is a bit-serial method where individual bits of the input values are used to address a Look-up Table (LUT) stored in, for example, a Read Only Memory (ROM). In general, this can be extended to a digit-serial method using Z bits per input value. The number Z is often referred to as the number of bits-at-a-time (baat). The LUT must be large enough to accommodate an input vector of length N with Z bits per input. One LUT could be used having an address of N * Z bits, however, this leads to a large LUT. The preferred embodiment of the present invention described herein uses Z LUT's each having an address of N-1 bits. The address reduction from N to N-1 exploits the fact that the absolute value of data is mirrored from the top half to the bottom half of a LUT having N address bits when offset binary is used to generate the LUT. The precomputed values in the LUT are inner products of the constant IDCT matrix in equation (1) and a single bit from each of N input values. These pre-computed values are then summed in a digit-serial manner to produce the complete inner product values.
In the decoder of FIG. 13, the IDCT circuits 25a and 25b can be implemented using IDCT processors that employ distributed arithmetic techniques. The disadvantage to the approach is that the pipeline is not kept full in the first 1-D IDCT section of both 25a and 25b. Eight 12-bit parallel input words are required to perform the inner product in FIG. 13. In distributed arithmetic, the word width in bits is divided by the number of input words to get the ideal number of bits-at-a-time (baat). In this case twelve divided by eight is 1.5. This number must be rounded up to 2 for actual implementation. The clock period required for the distributed arithmetic is calculated by dividing the word width in bits by the number of baat. The result of 12 divided by 2 is 6 clock periods. The eight input words require eight clock periods. Therefore, the pipeline is idle for 2 clock periods. In other words, resources in the first 1-D IDCT section of both 25a and 25b are not used for 2 out of 8 clocks. Although the processing speed of the decoder is maintained using the dual processing paths, the cost of the decoder is increased using the duplicate IDCT circuits.
The '598 patent illustrates an alternative method in a two-dimensional DCT processor which transforms pixels to DCT coefficients using distributed arithmetic. The '598 uses distributed arithmetic to simultaneously compute the inner product of an entire row or column of a matrix. The DCT processor includes a Nx1 column DCT processor which includes N circuits that compute the elements of the column transformation concurrently. The elements of the column transformation are stored in a transposition memory. Then, after being transposed, a Nx1 row processor transforms the output of the transposition memory. The '598 patent is not provided with pixel data from parallel paths and, thus, does not produce DCT coefficients from parallel processing paths. The '598 patent separates the pixels from a single processing path to transform that data in parallel. Consequently, the pixel data is processed at the rate it is received.