1. Field of the Invention
The present invention is directed to a system and method for enhancing the quality of a digital video signal using coding information and local spatial features. The system and method of the invention enhances the sharpness of encoded or transcoded digital video without enhancing encoding artifacts.
2. Description of the Related Art
The development of high-quality multi-media devices, such as set-top boxes, high-end TV's, Digital TV's, Personal TV's, storage products, PDA's, wireless internet devices, etc., is leading to a variety of architectures and to more openness towards new features for these devices. Moreover, the development of these new products and their ability to display video data in any format, has resulted in new requirements and opportunities with respect to video processing and video enhancement algorithms.
MPEG (Moving Picture Expert Group) video compression is used in many current and emerging products. MPEG is at the heart of digital television set-top boxes, DSS, HDTV decoders, DVD players, video conferencing, Internet video, and other applications. These applications benefit from video compression by requiring less storage space for archived video information, less bandwidth for the transmission of the video information from one point to another, or a combination of both. Most of these devices receive and/or store video in the MPEG-2 format. In the future, they may receive and/or store video in the MPEG-4 format. The picture quality of these MPEG sources can vary greatly.
Research into human visual system has shown that the eye is more sensitive to changes in luminance, and less sensitive to variations in chrominance. MPEG operates on a color space that effectively takes advantage of the eye's different sensitivity to luminance and chrominance information. Thus, MPEG uses a YCbCr color space to represent the data values instead of RGB; where Y is the luminance component, experimentally determined to be Y=0.299R+0.587G+0.114B, Cb is the blue color difference component, where Cb=B−Y, and Cr is the red color difference component, where Cr=R−Y.
MPEG video is arranged into a hierarchy of layers to help with error handling, random search and editing, and synchronization, for example with an audio bit-stream. The first layer, or top layer, is known as the video sequence layer, and is any self-contained bitstream, for example a coded movie, advertisement or a cartoon.
The second layer, below the first layer, is the group of pictures (GOP), which is composed of one or more groups of intra (I) frames and/or non-intra (P or B) pictures. I frames are strictly intra compressed, providing random access points to the video. P frames are motion-compensated forward-predictive-coded frames, which are inter-frame compressed, and typically provide more compression than I frames. B frames are motion-compensated bidirectionally-predictive-coded frames, which are inter-frame compressed, and typically provide the most compression.
The third layer, below the second layer, is the picture layer itself. The fourth layer beneath the third layer is called the slice layer. Each slice is a contiguous sequence of raster ordered macroblocks, most often on a row basis in typical video applications. The slice structure is intended to allow decoding in the presence of errors. Each slice consists of macroblocks, which are 16×16 arrays of luminance pixels, or picture data elements, with two 8×8 arrays (depending on format) of associated chrominance pixels. The macroblocks can be further divided into distinct 8×8 blocks, for further processing such as transform coding. A macroblock can be represented in several different manners when referring to the YCbCr color space. The three formats commonly used are known as 4:4:4, 4:2:2 and 4:2:0 video. 4:2:2 contains half as much chrominance information as 4:4:4, which is a full bandwidth YCbCr video, and 4:2:0 contains one quarter of the chrominance information. Because of the efficient manner of luminance and chrominance representation, the 4:2:0 representation allows immediate data reduction from 12 blocks/macroblock to 6 blocks/macroblock.
I frames provide only moderate compression as compared to the P and B frames, where MPEG derives its maximum compression efficiency. The efficiency is achieved through a technique called motion compensation based prediction, which exploits temporal redundancy. Since frames are closely related, it is assumed that a current picture can be modeled as a translation of the picture at the previous time. It is possible then to accurately predict the data of one frame based on the data of a previous frame. In P frames, each 16×16 sized macroblock is predicted from the macroblocks of previously encoded I or P picture. Since frames are snapshots in time of a moving object, the macroblocks in the two frames may not correspond to the same spatial location. The encoder would search the previous frame (for P-frames, or the frames before and after for B-frames) in half pixel increments for other macroblock locations that are a close match to the information that is contained in the current macroblock. The displacements in the horizontal and vertical directions of the best match macroblocks from a cosited macroblock are called motion vectors. The difference between the current block and the matching block and the motion vector are encoded. The motion vectors can also be used for motion prediction in case of corrupted data, and sophisticated decoder algorithms can use these vectors for error concealment. For B frames, motion compensation based prediction and interpolation is performed using reference frames present on either side of each frame.
Next generation storage devices, such as the blue-laser-based Digital Video Recorder (DVR) will have to some extent HD (High Definition) (ATSC) capability and are an example of the type of device for which a new method of picture enhancement would be advantageous. An HD program is typically broadcast at 20 Mb/s and encoded according to the MPEG-2 video standard. Taking into account the approximately 25 Gb storage capacity of the DVR, this represents about a two-hour recording time of HD video per disc. To increase the record time, several long-play modes can be defined, such as Long-Play (LP) and Extended-Long-Play (ELP) modes.
For LP-mode the average storage bitrate is assumed to be approximately 10 Mb/s, which allows double record time for HD. As a consequence, transcoding is an integral part of the video processing chain, which reduces the broadcast bitrate of 20 Mb/s to the storage bitrate of 10 Mb/s. During the MPEG-2 transcoding, the picture quality (e.g., sharpness) of the video, is most likely reduced. However, especially for the LP mode, the picture quality should not be compromised too much. Therefore, for the LP mode, post-processing plays an important role in improving the perceived picture quality.
To date, most of the state-of-the-art sharpness enhancement algorithms were developed and optimized for analog video transmission standards like NTSC (National Television System Committee), PAL (Phase Alternation Line) and SECAM (SEquential Couleur A Memoire). Traditionally, image enhancement algorithms either reduce certain unwanted aspects in a picture (e.g., noise reduction) or improve certain desired characteristics of an image (e.g., sharpness enhancement). For these emerging storage devices, the traditional sharpness enhancement algorithms may perform sub-optimally on MPEG encoded or transcoded video due to the different characteristics of these sources. In the closed video processing chain of the storage system, information which allows for determining the quality of the encoded source can be derived from the MPEG stream. This information can potentially be used to increase the performance of video enhancement algorithms.
Because picture quality will remain a distinguishing factor for high-end video products, new approaches for performing video enhancement, specifically adapted for use with these sources, will be beneficial. In C-J Tsai, P. Karunaratne, N. P. Galatsanos and A. K. Katsaggelos, “A Compressed Video Enhancement Algorithm”, Proc. of IEEE, ICIP'99, Kobe, Japan, Oct. 25-28, 1999, the authors propose an iterative algorithm for enhancing video sequences that are encoded at low bit rates. For MPEG sources, the degradation of the picture quality originates mostly from the quantization function. Thus, the iterative gradient-projection algorithm employed by the authors uses coding information such as quantization step size, macroblock types and forward motion vectors in its cost function. The algorithm shows promising results for low bit rate video, however, the method is marked by high computational complexity.
In B. Martins and S. Forchammer, “Improved Decoding of MPEG-2 Coded Video”, Proc. of IBC'2000, Amsterdam, The Netherlands, Sep. 7-12, 2000, pp. 109-115, the authors describe a new concept for improving the decoding of MPEG-2 coded video. Specifically, a unified approach for deinterlacing and format conversion, integrated in the decoding process, is proposed. The technique results in considerably higher picture quality than that obtained by ordinary decoding. However, to date, its computational complexity prevents its implementation in consumer applications.
Both papers describe video enhancement algorithms using MPEG coding information and a cost function. However, both of these scenarios, in addition to being impractical, combine the enhancement and the cost function. A cost function determines how much, and at which locations in a picture, enhancement can be applied. The problem which results from this combination of cost and enhancement functions is that only one algorithm can be used with the cost function.
Moreover, previous attempts to improve the sharpness enhancement algorithms, utilized only the coding information from the MPEG bitstream. The previous sharpness enhancement algorithms did not differentiate between different picture types, such as I, P and B frames. Consequently, the optimal sharpness enhancement result was not achieved, as picture parts with coding artifacts and the artifact-free parts were not differentiated. The result may be a sub-optimal sharpness enhancement.