Digital video coding technology enables the efficient storage and transmission of the vast amounts of visual data that compose a digital video sequence. With the development of international digital video coding standards, digital video has now become commonplace in a host of applications, ranging from video conferencing and DVDs to digital TV, mobile video, and Internet video streaming and sharing. Digital video coding standards provide the interoperability and flexibility needed to fuel the growth of digital video applications worldwide.
There are two international organizations currently responsible for developing and implementing digital video coding standards: the Video Coding Experts Group (“VCEG”) under the authority of the International Telecommunication Union—Telecommunication Standardization Sector (“ITU-T”) and the Moving Pictures Experts Group (“MPEG”) under the authority of the International Organization for Standardization (“ISO”) and the International Electrotechnical Commission (“IEC”). The ITU-T has developed the H.26x (e.g., H.261, H.263) family of video coding standards and the ISO/IEC has developed the MPEG-x (e.g., MPEG-1, MPEG-4) family of video coding standards. The H.26x standards have been designed mostly for real-time video communication applications, such as video conferencing and video telephony, while the MPEG standards have been designed to address the needs of video storage, video broadcasting, and video streaming applications.
The ITU-T and the ISO/IEC have also joined efforts in developing high-performance, high-quality video coding standards, including the previous H.262 (or MPEG-2) and the recent H.264 (or MPEG-4 Part 10/AVC) standard. The H.264 video coding standard, adopted in 2003, provides high video quality at substantially lower bit rates (up to 50%) than previous video coding standards. The H.264 standard provides enough flexibility to be applied to a wide variety of applications, including low and high bit rate applications as well as low and high resolution applications, such as video telephony, video gaming, video surveillance and many others. Other advanced multimedia applications may be easily deployed over existing and future networks.
The H.264 video coding standard has a number of advantages that distinguish it from other existing video coding standards, while sharing common features with those standards. The basic video coding structure of H.264 is illustrated in FIG. 1. H.264 video coder 100 divides each video frame of a digital video sequence into 16×16 blocks of pixels (referred to as “macroblocks”) so that processing of a frame may be performed at a block level.
Each macroblock may be coded as an intra-coded macroblock by using information from its current video frame or as an inter-coded macroblock by using information from its previous frames. Intra-coded macroblocks are coded to exploit the spatial redundancies that exist within a given video frame through transform, quantization, and entropy (or variable-length) coding. Inter-coded macroblocks are coded to exploit the temporal redundancies that exist between macroblocks in successive frames, so that only changes between successive frames need to be coded. This is accomplished through motion estimation and compensation.
In order to increase the efficiency of the intra coding process for the intra-coded macroblocks, spatial correlation between adjacent macroblocks in a given frame is exploited by using intra prediction 105. Since adjacent macroblocks in a given frame tend to have similar visual properties, a given macroblock in a frame may be predicted from already coded, surrounding macroblocks. The difference or residual between the given macroblock and its prediction is then coded, thereby resulting in fewer bits to represent the given macroblock as compared to coding it directly. A block diagram illustrating intra prediction in more detail is shown in FIG. 2.
Intra prediction may be performed for an entire 16×16 macroblock or it may be performed for each 4×4 block within a 16×16 macroblock. These two different prediction types are denoted by “Intra—16×16” and “Intra—4×4”, respectively. The Intra—16×16 mode is more suited for coding very smooth areas of a video frame, while the Intra—4×4 mode is more suited for coding areas of a video frame having significant detail.
In the Intra—4×4 mode, each 4×4 block is predicted from spatially neighboring samples as illustrated in FIGS. 3A-3B. The sixteen samples of the 4×4 block 300 which are labeled as “a-p” are predicted using prior decoded, i.e., reconstructed, samples in adjacent blocks labeled as “A-Q.” That is, block X 305 is predicted from reconstructed pixels of neighboring blocks A 310, B 315, C 320, and D 325. Specifically, intra prediction is performed using data in blocks above and to the left of the block being predicted, by, for example, taking the lower right pixels of the block above and to the left of the block being predicted, the lower row of pixels of the block above the block being predicted, the lower row of pixels of the block above and to the right of the block being predicted, and the right column of pixels of the block to the left of the block being predicted.
For each 4×4 block in a macroblock, one of nine intra prediction modes defined by the H.264 video coding standard may be used. The nine intra prediction modes are illustrated in FIG. 4. In addition to a “DC” prediction mode (Mode 2), eight directional prediction modes are specified. Those modes are suitable to predict directional structures in a video frame such as edges at various angles.
Typical H.264 video coders select one from the nine possible Intra—4×4 prediction modes according to some criterion to code each 4×4 block within an intra-coded macroblock, in a process commonly referred to as intra coding “mode decision” or “mode selection”. Once the intra prediction mode is selected, the prediction pixels are taken from the reconstructed version of the neighboring blocks to form the prediction block. The residual is then obtained by subtracting the prediction block from the current block, as illustrated in FIG. 2.
The mode decision criterion usually involves optimization of a cost to code the residual, as illustrated in FIG. 5 with the pseudo code implemented in the JM reference encoder publicly available at http://iphome.hhi.de/suehring/tml/. The residual is the difference of the pixel values between the current block and the predicted block formed by the reconstructed pixels in the neighboring blocks. The cost evaluated can be a Sum of the Absolute Differences (“SAD”) cost between the original block and the predicted block, a Sum of the Square Differences (“SSE”) cost between the original block and the predicted block, or, more commonly utilized, a rate-distortion cost. The rate-distortion cost evaluates the Lagrange cost for predicting the block with each candidate mode out of the nine possible modes and selects the mode that yields the minimum Lagrange cost.
Because of its high coding efficiency, the H.264 video coding standard is able to compress multimedia contents at low bit rates while achieving good visual quality. The H.264 video coding standard is also designed to provide robustness in error-prone environments and content-based scalability. These features allow H.264-encoded video to be accessible over a wide range of media at various qualities and temporal and spatial resolutions. Despite these beneficial functionalities, however, typical H.264 video coders are not suited for coding a single video sequence for distribution to multiple users at multiple devices. This is because when H.264 video coders encode a video sequence for distribution, they do not typically know the types of devices where the video sequence will be played at. As a result, a video sequence encoded by pre-set coding parameters may be unable to be displayed on some devices.
For example, suppose a video sequence is coded with a H.264 video coder at a given bit rate, visual quality and resolution. The video sequence may be distributed to a user of a personal computer, a user of a personal digital assistant, and a user of a small mobile device. Depending on the bit rate and resolution of the encoded video sequence, it may be impractical—or even impossible with some of the devices that are currently available—for the user of the personal digital assistant and/or the user of the small mobile device to view the video sequence. In particular, the display screen size of those devices may be too small for the video sequence to be properly displayed, in addition to other bandwidth and memory constraints.
To address these different display sizes and device capabilities, several techniques have been proposed. The most popular ones involve transcoding and/or encoding a Region-of-Interest (“ROI”) within a video sequence. In general, transcoding techniques convert the bit rate of a coded video sequence to match the bandwidth and other requirements of the display device. In ROI transcoding, a video sequence is divided in two parts: one representing the ROI and the other representing the background. The ROI may be any region or portion of the video sequence of interest to a user, such as, for example, a given object, person, or area within a scene. In most cases, the ROI is defined as a rectangular region surrounding the portion of the video sequence of interest. The user may identify the rectangular region prior to encoding the video sequence or specify it during decoding.
For example, in one technique users have to interact with a network server to specify the ROI and wait for the transcoded sequence. The ROI is typically sent with high visual quality and the background is either sent with low visual quality or not sent at all depending on the network bandwidth. In another example, the ROI is pre-specified during encoding, which takes advantage of the Flexible Macroblock Ordering (“FMO”) feature available in the H.264 video coding standard to prioritize particular slice groups. In yet another example, a preprocessor is used to identify a ROI, which is then coded and transmitted using a FMO mapping function.
These and other ROI-based transcoding techniques are limited in that once the ROI is determined, its size and position cannot be modified during the decoding process. That is, an arbitrary-sized ROI cannot be extracted at different access points of the video sequence. For example, consider a single video sequence of a customer shopping at a store. The store security personnel may desire to select a ROI around an aisle in the store for proper identification of a customer suspected of shoplifting at that aisle. The store security personnel may also desire to select a ROI around the cashier region of the store to get a better view of the suspect's face. With currently available H.264 video coders, the store security personnel cannot decode the single video sequence to have access to lower resolution, but yet, ROI-focused portions of the video sequence, i.e., the portions corresponding to the particular aisle and cashier region of the store.
Accordingly, it would be desirable to provide video coding techniques for supporting extraction of arbitrary-sized ROIs at different access points during decoding of a video sequence. In particular, it would be desirable to provide a video coding technique such that a video sequence can be encoded once and used by multiple devices with different display screen sizes and video decoding/playing capabilities.