1. Field of the Invention
The present invention relates generally to video coding, and in particular, to pre-quantization of motion compensated blocks for video coding at very low bit rates. The present invention provides a method and an apparatus for significantly reducing the number of computations at a video encoder.
2. Background Art
FIG. 1 illustrates the general structural blocks that are used for, and the steps involved in, the conventional digital coding of a sequence of video images. In particular, the video image is made up of a sequence of video frames 10 that are captured, such as by a digital camera, and transmitted to a video encoder 12. The video encoder 12 receives the digital data on a frame-by-frame and macroblock-by-macroblock basis, and applies a video encoding algorithm to compress the video data. In some applications, the video encoding algorithm can also be implemented in hardware. The video encoder 12 generates an output which consists of a binary bit stream 14 that is processed by a modulator 16. The modulator 16 modulates the binary bit stream 14 and provides the appropriate error protection. The modulated binary bit stream 14 is then transmitted over an appropriate transmission channel 18, such as through a wireless connection (e.g., radio frequency), a wired connection, or via the Internet. The transmission can be done in an analog format (e.g., over phone lines or via satellite) or in a digital format (e.g., via ISDN or cable). The transmitted binary bit stream 14 is then demodulated by a demodulator 20 and provided to a video decoder 22. The video decoder 22 takes the demodulated binary bit stream 24 and converts or decodes it into sequential video frames. These video frames are then provided to a display 26, such as a television screen or monitor, where they can be viewed. If the transmission channel 18 utilizes an analog format, a digital-to-analog converter is provided at the modulator 16 to convert the digital video data to analog form for transmission, and an analog-to-digital converter is provided at the demodulator to convert the analog signals back into digital form for decoding and display.
The video encoding can be embodied in a variety of ways. For example, the actual scene or image can be captured by a camera and provided to a chipset for video encoding. This chipset could take the form of an add-on card that is added to a personal computer (PC). As another example, the camera can include an on-board chip that performs the video encoding. This on-board chip could take the form of an add-on card that is added to a PC, or as a separate stand-alone video phone. As yet another example, the camera could be provided on a PC and the images provided directly to the processor on the PC which performs the video encoding.
Similarly, the video decoder 22 can be embodied in the form of a chip that is incorporated either into a PC or into a video box that is connected to a display unit, such as a monitor or television set.
Each digital video frame 10 is made up of x columns and y rows of pixels (also known as "pels"). In a typical frame 10 (see FIG. 2), there could be 720 columns and 640 rows of pels. Since each pel contains 8 bits of data (for luminance data), each frame 10 could have over three million bits of data (for luminance data). If we include chrominance data, each pel has up to 24 bits of data, so that this number is even greater. This large quantity of data is unsuitable for data storage or transmission because most applications have limited storage (i.e., memory) or limited channel bandwidth. To respond to the large quantity of data that has to be stored or transmitted, techniques have been provided for compressing the data from one frame 10 or a sequence of frames 10 to provide an output that contains a minimal amount of data. This process of compressing large amounts of data from successive video frames is called video compression, and is performed in the video encoder 12.
During conventional video encoding, the video encoder 12 will take each frame 10 and divide it into blocks. In particular, each frame 10 can be first divided into macroblocks MB, as shown in FIG. 2. Each of these macroblocks MB can have, for example, 16 rows and 16 columns of pels. Each macroblock MB can be further divided into four blocks B, each block having 8 rows and 8 columns of pels. Once each frame 10 has been divided into blocks B, the video encoder 12 is ready to compress the data in the frame 10.
FIG. 3 illustrates the different steps, and the possible hardware components, that are used by the conventional video encoder 12 to carry out the video compression. Each frame 10 is provided to a motion estimation engine 30 which performs motion estimation. Since each frame 10 contains a plurality of blocks B, the following steps will process each frame 10 on a block-by-block basis.
Motion estimation calculates the displacement of one frame in a sequence with respect to the previous frame. By calculating the displacement on a block basis, a displaced frame difference can be computed which is easier to code, thereby reducing temporal redundancies. For example, since the background of a picture or image usually does not change, the entire frame does not need to be encoded, and only the moving objects within that frame (i.e., representing the differences between sequential frames) need to be encoded. Motion estimation will predict how much the moving object will move in the next frame based on certain motion vectors, and will then take the object and move it from a previously reconstructed frame to form a predicted frame. At the video decoder 22, the previously reconstructed frame, together with the motion vectors used for that frame, will reproduce the predicted frame at the video decoder 22 (also known as "motion compensation"). The predicted frame is then subtracted from the previously reconstructed frame to obtain an "error" frame. This "error" frame will contain zeros at the pels where the background did not move from the previously reconstructed frame to the predicted frame. Since the background makes up a large part of the picture or image, the "error" frame will typically contain many zeros.
Each frame 10 can be either an "intraframe" (also known as "I" frame) or an "interframe" (also known as "P" frame). Each I frame is coded independently, while each P frame depends on previous frames. In other words, a P frame uses temporal data from previous P frames to remove temporal redundancies. An example of a temporal redundancy can be the background of an image that does not move from one frame to another, as described above. For example, the "error" frame described above would be a P frame. In addition to I and P frames, there also exists another type of frame, known as a "B" frame, which uses both previous and future frames for prediction purposes.
Now, referring back to FIG. 3, all digital frames 10 received from the motion estimation engine 30 are provided to a frame-type decision engine 40, which operates to divide all the incoming frames 10 into I frames, P frames and B frames. Whether a frame 10 becomes an I, P or B frame is determined by the amount of motion experienced by that frame 10, the degradation of distortion, type of channel decisions, and desired user parameters, among other factors. From this point onward, all I, P and B frames are processed in the same manner.
Each block B from each frame 10 is now provided to a QP decision engine 50 which determines a QP or quantization step size number for the block or groups of blocks. This QP number is determined by a rate control mechanism which divides a fixed bit budget of a frame among different blocks, and is used by the quantization engine 80 to carry out quantization as described below.
Each block B is now provided to a DCT engine 60. DCT of individual blocks helps in removing the spatial redundancy by bringing down the most relevant information into the lower most coefficients in the DCT domain. DCT can be accomplished by carrying out a Fourier-like transformation of the values in each block B. DCT produces a transformed block 70 in which the zeros and lower values are placed in the top left comer 72 of the transformed block 70, and the higher frequency values are placed in the bottom right corner 74.
After having obtained a block 70 of DCT coefficients which contain the energy of the displaced blocks, quantization of these blocks 70 is performed by quantization engine 80. Quantization is a uniform quantization with a step size (i.e., the predetermined QP) varying within a certain range, such as from 2 to 62. It is implemented as a division, or as a table look-up operation for a fixed-point implementation, of each value in the transformed block 70. For example, the quantization level for each value in the block 70 can be determined by dividing the value by 2QP. Therefore, if QP is 10 and a value in the block is 100, then the quantization level for this level is equal to 100 divided by 2QP, or 5. At the video decoder 22 in FIG. 1, the value is reconstructed by multiplying the quantization level (i.e., 5) by 2QP to obtain the original value of 100. Thus, quantization takes a finite set of values and maps the set of values, providing a quantized block 90 where the top left comer 92 contains higher quantized levels, and the bottom right corner 94 contains mostly zeros.
Next, the quantized block 90 is provided to a zig-zag scan engine 100 which performs a zig-zag scan of the values in the block 90. The direction of the scan is illustrated in FIG. 4, and begins from the top left corner 92, which contains the higher quantized levels, through the middle of the block 90 and to the bottom right corner 94, which contains mostly zeros. The zig-zag scan produces a zig-zag scan block 110 in which the quantized values from the quantized block 90 are positioned linearly across the zig-zag scan block 110. Therefore, zig-zag scan emulates going from a lower to a higher frequency, thereby resulting in long runs of zeros in the zig-zag scan block 110. The values in the zig-zag scan block 110 are then provided to a variable length coding engine 120 where entropy coding is performed. Traditionally, most video coding standards use huffman coding for entropy coding. First, a non-zero value followed by runs of zeros is encoded as a single "event". For example, "400000000000" and "10000000000000" would each be encoded as separate single events. Entropy coding is then performed on these events to generate a unique binary code for each event. These binary codes are output as the binary bitstream 14 described above. These unique binary codes can be recognized by the video decoder 22 and decoded by the video decoder 22 into the original values (i.e., non-zero values followed by runs of zeros).
Thus, the conventional video encoder 12 and its operation, as illustrated in FIG. 3, function to minimize (i.e., compress) the large number of bits at the input blocks B of each frame 10 (see FIG. 2) to a minimal number of bits at the bitstream 14, taking advantage of the fact that the DCT and quantization steps will produce multiple runs of zeros. The transmitted bitstream 14 is decoded by the video decoder 22 by reversing the steps performed by the video encoder 12.
The values in each frame 10 can represent different meanings. For example, in an I frame, each value can range from zero to 255, with zero representing the darkest (or black) pel, and 255 representing the brightest pel. In a P frame, each value can range from -128 to +127, with -128 and +127 representing the maximum residual value possible or a lot of edge information, and zero representing no residual.
While the above-described conventional video encoder 12 and method is effective in compressing the amount of data to be transmitted, it requires much computation and therefore increases the time and cost of the video encoder 12 and video decoder 22. In particular, motion estimation is the most computationally intensive part of the video encoding process, and often accounts for more than half of the processing. For this reason, many video encoding solutions prefer to perform motion estimation either by using dedicated hardware, or by some fast sub-optimal software scheme. Dedicated hardware can be realized as an ASIC (Application Specific Integrated Circuit) or as an FPGA (Field Programmable Gate Array). While dedicated hardware provides fast and accurate motion estimation, it can be very expensive. As a result, software schemes are often preferred because they are less expensive. These software schemes achieve fast motion estimation by doing a sub-optimal search using the inherent processor of a PC or workstation. Unfortunately, the motion estimation performed by these software schemes are generally less accurate.
Although motion estimation is the most computationally intensive part of the video encoding process, DCT, quantization, zig-zag scanning and variable length coding are also computationally intensive. Unfortunately, in the conventional video encoding and decoding method, all frames 10 must go through DCT, quantization, zig-zag scan and variable length coding.
Thus, there still remains a need for a video encoder and method which significantly reduces the computation performed by the video encoder and the video decoder without suffering any degradation in the perceived quality of the compressed video data.