The present invention relates in general to video compression and more particularly to a system and a method for encoding, transmitting, decoding and storing a high-resolution video sequence using a low-resolution base layer and a higher-resolution enhancement layer.
Bit rate reduction is vitally important to achieve the objective of sending as much information as possible with a given communication or storage capacity. Bit rate is the amount of data that can be transmitted in a given time. Bit rate reduction is important because communication capacity is limited by regulatory, physical and commercial constraints and, as demand increases for higher resolution television and video, it is crucial that maximum use be made of the limited capacity available on any given communication or storage medium.
One technique of managing bit rate is data compression. Data compression is storing data in a format that requires less space than would otherwise be used to store the information. Data compression is particularly useful in the transmission of information because it allows a large amount of information to be transmitted using a reduced number of bits. Lossless data compression, which is used mainly for compressing text information, programs, or other computer data refers to data compression formats in which no data is lost. Greater compression can be achieved on graphics, audio and video data by using lossy compression, which refers to data compression formats in which some amount of representation fidelity is lost. Most video compression formats use a lossy compression technique. A compression method with a high degree of bit rate reduction for a given level of fidelity is said to have good compression efficiency.
The well-known International Telecommunications Union-Telecommunications (ITU-T) H.26x and Moving Picture Experts Group (MPEG) video coding standards are examples of a family of conventional video compression formats that use lossy compression. These coding techniques provide high compression rates by representing some image frames as only the changes between frames rather than the entire frame. The changing information is then encoded using a technique called Motion-Compensated Discrete Cosine Transform (MC+DCT) coding. Motion compensation (MC) approximates each area of a video picture as a spatially-shifted area of a previously-decoded picture, and Discrete. Cosine Transform (DCT) coding is a technique that represents waveform data as a weighted sum of cosine waveforms. In general, ITU-T and MPEG video compression remove temporal redundancy between video frames by means of motion compensation, remove spatial redundancy within a video frame by means of a Discrete Cosine Transform and quantization approximation rounding of the DCT samples, and to remove statistical redundancy of quantized index values by means of statistical lossless entropy-reduction coding.
More particularly, ITU-T and MPEG coding work by dividing each frame into rectangular (such as 16xc3x9716 pixel) macroblocks and first determining how each macroblock has moved between frames. A motion vector defines any motion of the macroblock that occurs between frames and is used to construct a predicted frame. A process called motion estimation takes place in the encoder to determine the best motion vector value for each macroblock. This predicted frame, which is a previously-decoded frame adjusted by the motion vectors, is compared to an actual input frame. Any new information left over that is new is called the residual and used to construct residual frames.
There are generally three main types of coded pictures in such conventional video coding: (1) intra pictures (I-frames); (2) forward predicted pictures (P-frames); and (3) bi-directional predicted pictures (B-frames). I-frames are encoded as independent pictures with no reference to past or future frames. These frames contain full picture information and can be used to predict other frames. P-frames are encoded relative to the past frames, while B-frames are encoded relative to past frames, future frames or both. ITU-T and MPEG coding use these three types of frames and encoded motion vectors to represent video. This video representation is performed by using I-frames at the start of an independent sequence of pictures and then using P and B frames to encode the remaining pictures in the sequence.
One problem with ITU-T and MPEG coding is that the decoding of high-resolution video requires far greater computational complexity than what is required for lower-resolution video. This means that high-resolution decoders are significantly more expensive than those decoders used for lower resolution video. Delivery of high-resolution video also requires a much higher bit rate than does lower-resolution video. It is therefore highly desirable to provide support for delivery of the same video content as either low-resolution video or as high-resolution video.
One technique of video coding that encodes video using a low-resolution base layer and a higher-resolution enhancement layer is known as spatially-scalable video coding. Spatially-scalable video coding uses a base layer that is decodable as a conventional non-layered video representation at a lower bit rate than an enhancement layer used for the high-resolution video. This allows the base layer to serve lower-capacity receivers while enabling better service for higher-capacity receivers (that receive both the base and enhancement layers). The base layer may also be designed to conform to some prior standard encoding method, in order for the base layer to leverage receivers manufactured to popular and widely-used designs.
One disadvantage, however, of spatially-scalable video coding is that there is a significant loss of compression efficiency for the high-resolution video representation relative to a separate encoding of the high resolution video using the same total bit rate but without the scalability layering structure.
There exists a need, therefore, for a system and a method of encoding, transmitting, decoding and storing a high-resolution video sequence that provides higher compression efficiency than current standard techniques while retaining the advantages of spatially-scalable layered video coding. Such a system and a method would have relevance for HDTV and beyond, and could potentially become a universal video protocol for such widespread use as the Internet, digital video disks (DVD) and new generations of home and commercial video recording devices.
To overcome the limitations in the prior art as described above and other limitations that will become apparent upon reading and understanding the present specification, the present invention is embodied in a system and a method for transmitting and storing high-resolution video using a low-resolution base layer and a higher-resolution enhancement layer. The present invention uses decoded low-resolution images and additional data from the low-resolution video representation to aid in the decoding of the higher-resolution video. In particular, a preferred embodiment uses motion vector data from the low-resolution video representation to aid in the decoding of the higher-resolution video. The present invention provides high fidelity, uses a minimum amount of bit rate, and can be applied in a manner which allows the low-resolution video to remain backward-compatible with existing standard video compression technology (such as the ITU-T and MPEG standards).
In particular, the present invention is especially well-suited for transmitting and delivering encoded higher-resolution video so that it can be viewed simultaneously in low resolution by a base layer decoder and in enhanced high resolution by an enhancement layer decoder. The present invention divides and encodes a high-resolution video sequence into a lower-resolution base layer and a higher-resolution enhancement layer. The low-resolution base layer, although encoded using a special encoder, can remain, if desired, completely compatible with existing standard video compression formats (such as ITU-T or MPEG standards). If the present invention is designed in this compatible fashion, the low-resolution base layer can be correctly decoded using a base layer decoder without access to (or any awareness of) the enhancement layer data. Thus, a decoder for an existing standard (such as ITU-T or MPEG standards) may be leveraged by the present invention to decode the low-resolution base layer without any knowledge or use of the accompanying higher-resolution enhancement layer.
The higher-resolution enhancement layer of the present invention is decoded using an enhancement decoder that makes use of the decoded base layer images, the enhancement layer encoded data and additional data from the encoded base layer data. In particular, in a preferred embodiment, the encoded base layer data includes motion vector data that is used to decode the enhancement layer. A key element of the invention is that, in addition to the decoded images, base layer data (such as motion vectors) are used in decoding both the base layer and the enhancement layer. In other words, the decoding of the enhancement layer leverages data used in the decoding of the base layer.
The amount of enhancement is variable and can be selected by, for example, a viewer, a manufacturer or a cable service provider. Variable enhancement is important because different devices can subscribe to different levels of enhancement. For example, a 32-inch HDTV may subscribe to less enhancement than a 64-inch HDTV because the 32-inch HDTV requires less resolution. In addition, variable amounts of enhancement permit flexible pricing schemes whereby more expensive televisions and cable boxes provide greater resolution.
The system of the present invention includes a layered video encoder and a layered video decoder, and alternatively a base video decoder not supporting the layering feature. The layered video encoder receives a high-resolution video sequence and outputs a compressed video stream including a base layer and an enhancement layer. The base layer compressed stream, which is created by a base layer encoding module, is an independently-decodable low-resolution video stream. In some embodiments, the base layer video stream will conform to an existing standard video compression format and will be capable of being decoded by standard video decoders. The base layer contains data (such as motion vectors) that are used in decoding both the base layer and the enhancement layer. The enhancement layer, which is created by an enhancement layer encoding module, is a higher-resolution video stream that contains higher-resolution video information and provides high-resolution enhancement to the base layer. The layered video decoder includes a base layer decoder module, for decoding the base layer, and an enhancement layer decoder module, for decoding the enhancement layer.
The present invention also includes a method for encoding a high-resolution video sequence to produce a base layer and an enhancement layer. Specifically, the present invention processes the high-resolution video sequence to create a low-resolution base layer that is independently decodable. Further, the present invention then produces an enhancement layer which provides higher-resolution enhancement to the base layer and permits the amount of enhancement to be continuously varied. In addition, the ratio of the bit rate dedicated to the transmission of the base layer versus the enhancement layer is variable and can be adjusted to suit the needs of the specific application.
The present invention also includes a method for leveraging transmitted motion vector information for use in the decoding of both the base and enhancement layers. Specifically, the present invention calculates motion vectors in a high-resolution video sequence, scales them down for use with lower-resolution video, transmits the motion vectors in a base layer and then scales them up during decoding. Although they are transmitted in the base layer, these motion vectors are used in the decoding of both the base layer and the enhancement layer. The present invention may be embodied in a computer-readable medium having several computer-executable modules for performing the functions described above.