1. Field of the Invention
The present invention relates generally to video coding, and in particular, to a video encoder having an integrated scaling mechanism.
2. Background Art
Generally, there are two types of compression: (1) moving picture compression (also known as motion estimation), and (2) still image compression. Some commonly used standards for moving picture compression include MPEG-I, MPEG-II, H.261, etc. The dominant standard for still image compression is the JPEG standard. Both moving picture compression and still image compression utilize discrete cosine transform (DCT) and variable length encoding (VLE) to compress data (i.e., eliminate the spatially redundant data and de-compress the data).
The moving picture compression utilizes temporal data to compress the data further. Specifically, moving picture compression employs motion estimation techniques that refer to data in previous and/or future frames, also known as B (bidirectional) and P (previous) frames, in an image. The basic scheme is to predict motion from frame to frame in the temporal direction, and then to use DCTs to organize the redundancy in the spatial directions.
In contrast, still image compression utilizes the current frame without considering a previous frame. In other words, still image compression only utilizes an I-frame (intra-frame) without referring to the B and P frames. Additional information regarding the JPEG Still Picture Compression Standard is provided in a paper entitled "The JPEG Still Picture Compression Standard", by Gregory Wallace of Digital Equipment Corp., submitted in December 1991 for publication in IEEE Transactions on Consumer Electronics (a copy is attached herewith).
Up-scaling of an image is important in still image compression and not in moving picture compression because data related to moving picture compression in a DCT buffer is a residue of the motion estimation data and cannot be scaled up.
FIG. 1 illustrates the general structural blocks that are used for, and the steps involved in, the conventional digital coding of a sequence of video images. In particular, the video image is made up of a sequence of video frames 10 that are captured, such as by a digital camera, and transmitted to a video encoder 12. The video encoder 12 receives the digital data on a frame-by-frame and macroblock-by-macroblock basis, and applies a video encoding algorithm to compress the video data. In some applications, the video encoding algorithm can also be implemented in hardware. The video encoder 12 generates an output which consists of a binary bit stream 14 that is processed by a modulator 16. The modulator 16 modulates the binary bit stream 14 and provides the appropriate error protection. The modulated binary bit stream 14 is then transmitted over an appropriate transmission channel 18, such as through a wireless connection (e.g., radio frequency), a wired connection, or via the Internet. The transmission can be done in an analog format (e.g., over phone lines or via satellite) or in a digital format (e.g., via ISDN or cable). The transmitted binary bit stream 14 is then demodulated by a demodulator 20 and provided to a video decoder 22. The video decoder 22 takes the demodulated binary bit stream 24 and converts or decodes it into sequential video frames. These video frames are then provided to a display 26, such as a television screen or monitor, where they can be viewed. If the transmission channel 18 utilizes an analog format, a digital-to-analog converter is provided at the modulator 16 to convert the digital video data to analog form for transmission, and an analog-to-digital converter is provided at the demodulator 20 to convert the analog signals back into digital form for decoding and display.
The video encoding can be embodied in a variety of ways. For example, the actual scene or image can be captured by a camera and provided to a chipset for video encoding. This chipset could take the form of an add-on card that is added to a personal computer (PC). As another example, the camera can include an on-board chip that performs the video encoding. This on-board chip could take the form of an add-on card that is added to a PC, or as a separate stand-alone video phone. As yet another example, the camera could be provided on a PC and the images provided directly to the processor on the PC which performs the video encoding.
Similarly, the video decoder 22 can be embodied in the form of a chip that is incorporated either into a PC or into a video box that is connected to a display unit, such as a monitor or television set.
Each digital video frame 10 is made up of x columns and y rows of pixels (also known as "pels"). In a typical frame 10 (see FIG. 2), there could be 720 columns and 640 rows of pels. Since each pel contains 8 bits of data (for luminance data), each frame 10 could have over three million bits of data (for luminance data). If we include chrominance data, each pel has up to 24 bits of data, so that this number is even greater. This large quantity of data is unsuitable for data storage or transmission because most applications have limited storage (i.e., memory) or limited channel bandwidth. To respond to the large quantity of data that has to be stored or transmitted, techniques have been provided for compressing the data from one frame 10 or a sequence of frames 10 to provide an output that contains a minimal amount of data. This process of compressing large amounts of data from successive video frames is called video compression, and is performed in the video encoder 12.
During conventional video encoding, the video encoder 12 will take each frame 10 and divide it into blocks. In particular, each frame 10 can be first divided into macroblocks MB, as shown in FIG. 2. Each of these macroblocks MB can have, for example, 16 rows and 16 columns of pels. Each macroblock MB can be further divided into four blocks B, each block having 8 rows and 8 columns of pels. Once each frame 10 has been divided into blocks B, the video encoder 12 is ready to compress the data in the frame 10.
FIG. 3 illustrates the different steps, and the possible hardware components, that are used by the conventional video encoder 12 to carry out the video compression. Since each frame 10 contains a plurality of blocks B, the following steps will process each frame 10 on a block-by-block basis.
Each block B from each frame 10 is provided to a memory 42 that is provided to store the unscaled image. A separate upscaler circuit 44 reads the unscaled image from memory 42, scales the image, and writes the scaled image back to memory 42. As will be described later, a DCT block 60 reads the scaled image for further processing. FIG. 5, which is described further hereinafter, describes more fully the interaction of memory 42, upscaler 44, and DCT 60.
Each block B from each frame 10 is also provided to a QP decision engine 50 which determines a QP or quantization step size number for the block or groups of blocks. This QP number is determined by a rate control mechanism which divides a fixed bit budget of a frame among different blocks, and is used by the quantization engine 80 to carry out quantization as described below.
Each block B is now provided to a DCT engine 60. DCT of individual blocks helps in removing the spatial redundancy by bringing down the most relevant information into the lower most coefficients in the DCT domain. DCT can be accomplished by carrying out a Fourier-like transformation of the values in each block B. DCT produces a transformed block 70 in which the zeros and lower values are placed in the top left corner 72 of the transformed block 70, and the higher frequency values are placed in the bottom right corner 74.
After having obtained a block 70 of DCT coefficients which contain the energy of the displaced blocks, quantization of these blocks 70 is performed by quantization engine 80. Quantization is a uniform quantization with a step size (i.e., the predetermined QP) varying within a certain range, such as from 2 to 62. It is implemented as a division, or as a table look-up operation for a fixed-point implementation, of each value in the transformed block 70. For example, the quantization level for each value in the block 70 can be determined by dividing the value by 2QP. Therefore, if QP is 10 and a value in the block is 100, then the quantization level for this level is equal to 100 divided by 2QP, or 5. At the video decoder 22 in FIG. 1, the value is reconstructed by multiplying the quantization level (i.e., 5) by 2QP to obtain the original value of 100. Thus, quantization takes a finite set of values and maps the set of values, providing a quantized block 90 where the top left corner 92 contains higher quantized levels, and the bottom right corner 94 contains mostly zeros.
Next, the quantized block 90 is provided to a zig-zag scan engine 100 which performs a zig-zag scan of the values in the block 90. The direction of the scan is illustrated in FIG. 4, and begins from the top left corner 92, which contains the higher quantized levels, through the middle of the block 90 and to the bottom right corner 94, which contains mostly zeros. The zig-zag scan produces a zig-zag scan block 110 in which the quantized values from the quantized block 90 are positioned linearly across the zig-zag scan block 110. Therefore, zig-zag scan emulates going from a lower to a higher frequency, thereby resulting in long runs of zeros in the zig-zag scan block 110.
The values in the zig-zag scan block 110 are then provided to a variable length coding engine 120 where entropy coding is performed. Traditionally, most video coding standards use Huffman coding for entropy coding. The JPEG standard can use either Huffman coding or arithmetic coding. First, a non-zero value followed by runs of zeros is encoded as a single "event". For example, "400000000000" and "10000000000000" would each be encoded as separate single events. Entropy coding is then performed on these events to generate a unique binary code for each event. These binary codes are output as the binary bitstream 14 described above. These unique binary codes can be recognized by the video decoder 22 and decoded by the video decoder 22 into the original values (i.e., non-zero values followed by runs of zeros).
Thus, the conventional video encoder 12 and its operation, as illustrated in FIG. 3, function to minimize (i.e., compress) the large number of bits at the input blocks B of each frame 10 (see FIG. 2) to a minimal number of bits at the bitstream 14, taking advantage of the fact that the DCT and quantization steps will produce multiple runs of zeros. The transmitted bitstream 14 is decoded by the video decoder 22 by reversing the steps performed by the video encoder 12.
Up-scaling is an important and needed operation since the format of an image captured by an image capture device in many cases is different from the format expected by a compression scheme utilized by the video compressor. For example, an image format commonly utilized by input devices, such as charge-coupled device (CCD) or complimentary metal oxide semiconductor (CMOS) based video cameras and video-cassette recorders (VCR), is the National Television Standards Committee (NTSC) format. A video frame in the NTSC format can be in a non-interlace display mode or a progressive display mode. The frame can have a size of 720.times.480. In an interlace display mode, an even field having the even lines of the frame and an odd field having odd lines of the frame are provided. The even field and the odd field can each have a size of 720.times.240.
In contrast, a common intermediate format (CIF) compression scheme expects an image to have a format with a size of 352.times.288. Accordingly, in an interlace mode, the odd field and even field both need to be up-scaled (in the y-direction) so that the height of the field is increased from 240 to 288. In other instances, depending on the format of the image as captured by the image capture device and the format expected by a particular compression algorithm, the input image may need to be up-scaled in the x-direction (i.e., the width of the image may need to be increased).
FIG. 5 illustrates a conventional approach to format the size of a captured image to a format suitable for a particular compression scheme. In step 150, an image is captured via an input device, such as a charge-coupled device (CCD) or complementary metal oxide semiconductor (CMOS) based video camera. In step 154, a cropping or down-scaling operation is performed on the captured image. In step 158, the down-scaled image is written into a memory, such as a dynamic random access memory (DRAM). In step 164, a video accelerator reads the down-scaled image from the memory. In step 168, the video accelerator performs an up-scaling operation (e.g., adjusting the height of the image so that it meets the requirements of the compression scheme) on the down-scaled image. In step 174, the up-scaled image is written into the memory. In step 178, a DCT module reads the up-scaled image from the memory.
As is evident from FIG. 5, the conventional approach requires that (1) additional memory be reserved to store the intermediate images; and (2) an access time be allotted to read and write the intermediate images from and to memory.
These additional memory accesses decreases the overall speed of the image processing system. In addition, these additional memory accesses reduce the available bandwidth of the memory bus, and increase the space in memory that needs to be allocated for the intermediate results.
Thus, there still remains a need for a video encoder and up-scaler that reduces the number of memory accesses and increases the available space in memory and the available memory bus bandwidth.