Video compression can be considered the process of representing digital video data in a form that uses fewer bits when stored or transmitted. Video encoding can achieve compression by exploiting redundancies in the video data, whether spatial, temporal, or color-space. Video compression processes typically segment the video data into portions, such as groups of frames and groups of pels, to identify areas of redundancy within the video that can be represented with fewer bits than required by the original video data. When these redundancies in the data are exploited, greater compression can be achieved. An encoder can be used to transform the video data into an encoded format, while a decoder can be used to transform encoded video back into a form comparable to the original video data. The implementation of the encoder/decoder is referred to as a codec.
Most modern standardized video encoders (referred to herein as “standard encoders”) divide a given video frame into non-overlapping coding units or macroblocks (rectangular regions of contiguous pels, herein referred to more generally as “input blocks” or “data blocks”) for encoding. Compression can be achieved when data blocks are predicted and encoded using previously-coded data. The process of encoding data blocks using spatially neighboring samples of previously-coded blocks within the same frame is referred to as intra-prediction. Intra-prediction attempts to exploit spatial redundancies in the data. The encoding of data blocks using similar regions from previously-coded frames, found using a motion estimation process, is referred to as inter-prediction. Inter-prediction attempts to exploit temporal redundancies in the data. The motion estimation process can generate a motion vector that specifies, for example, the location of a matching region in a reference frame relative to a data block that is being encoded.
The encoder may measure the difference between the data to be encoded and the prediction to generate a residual. The residual can provide the difference between a predicted block and the original data block. The predictions, motion vectors (for inter-prediction), residuals, and related data can be combined with other processes such as a spatial transform, a quantizer, an entropy encoder, and a loop filter to create an efficient encoding of the video data. The residual that has been quantized and transformed can be processed and added back to the prediction, assembled into a decoded frame, and stored in a framestore. Details of such encoding techniques for video will be familiar to a person skilled in the art.
H.264 (MPEG-4 Part 10, Advanced Video Coding [AVC]) and H.265 (MPEG-H Part 2, High Efficiency Video Coding [HEVC], hereafter referred to as H.264 and H.265, respectively, are two codec standards for video compression that achieve high quality video representation at relatively low bitrates. The basic coding unit for H.264 is 16×16 macroblocks, while the equivalent coding tree units for H.265 can take sizes from 16×16 up to 64×64 blocks.
Standard encoders typically define three types of frames (or pictures), based on how the data blocks in the frame are encoded. An I-frame (intra-coded picture) is encoded using only data present in the frame itself and thus consists of only intra-predicted blocks. A P-frame (predicted picture) is encoded via forward prediction, using data from previously-decoded I-frames or P-frames, also known as reference frames. P-frames can contain either intra blocks or (forward-)predicted blocks. A B-frame (bi-predicted picture) is encoded via bi-directional prediction, using data from both previous and subsequent frames. B-frames can contain intra, (forward-)predicted, or bi-predicted blocks.
A particular set of frames is termed a Group of Pictures (GOP). The GOP contains only the decoded pels within each reference frame and does not include information as to how the data blocks or frames themselves were originally encoded (I-frame, B-frame, or P-frame). Older video compression standards such as MPEG-2 use one reference frame (in the past) to predict P-frames and two reference frames (one past, one future) to predict B-frames. By contrast, more recent compression standards such as H.264 and H.265 allow the use of multiple reference frames for P-frame and B-frame prediction.
In standard encoders, inter-prediction is based on block-based motion estimation and compensation (BBMEC). The BBMEC process searches for the best match between the target block (the current data block being encoded) and same-sized regions within previously-decoded reference frames. When such a match is found, the encoder may transmit a motion vector, which serves as a pointer to the best match's position in the reference frame. For computational reasons, the BBMEC search process is limited, both temporally in terms of reference frames searched and spatially in terms of neighboring regions searched.
The simplest form of the BBMEC process initializes the motion estimation using a (0, 0) motion vector, meaning that the initial estimate of a target block is the co-located block in the reference frame. More recent motion estimation algorithms such as enhanced predictive zonal search (EPZS) [Tourapis, A.; “Enhanced predictive zonal search for single and multiple frame motion estimation,” Proc. SPIE 4671, Visual Communications and Image Processing, pp. 1069-1078, 2002] consider a set of motion vector candidates for the initial estimate of a target block, based on the motion vectors of neighboring blocks that have already been encoded, as well as the motion vectors of the co-located block (and neighbors) in the previous reference frame. Once the set of initial motion vector candidates have been gathered, fine motion estimation is then performed by searching in a local neighborhood of the initial motion vectors for the region that best matches (i.e., has lowest error in relation to) the target block. The local search may be performed by exhaustive query of the local neighborhood or by any one of several “fast search” methods, such as a diamond or hexagonal search.
For any given target block, the encoder may generate multiple inter-predictions to choose from. The predictions may result from multiple prediction processes. The predictions may also differ based on the subpartitioning of the target block, where different motion vectors are associated with different subpartitions of the target block and the respective motion vectors each point to a subpartition-sized region in a reference frame. The predictions may also differ based on the reference frames to which the motion vectors point. Selection of the best prediction for a given target block is usually accomplished through rate-distortion optimization, where the best prediction is the one that minimizes the rate-distortion metric D+λR, where the distortion D measures the error between the target block and the prediction, while the rate R quantifies the cost (in bits) to encode the prediction and λ is a scalar weighting factor.
Standard encoders modulate the amount of compression that occurs within a GOP, an individual frame, a row of data blocks within a frame, or an individual data block, by means of a quantization parameter (QP). If the QP value is high, more quantization occurs and fewer bits are used to represent the data, but the visual quality of the encoded output is worse. If the QP value is low, less quantization occurs and more bits are used, but the visual quality of the encoded output is better. This tradeoff between bitrate (number of bits in the output bitstream per second of the input video) and quality is well known to persons skilled in the art.
The rate control algorithm of the encoder sets the QP values for a frame (as the frame QP), a row of data blocks within a frame (as the row QP), or an individual data block (as the block QP). The rate control algorithm allocates a bit budget to each GOP, frame, and row to achieve a target bitrate for the video encoding. Based on how many bits have been used in the encoding relative to the target bitrate and how full a virtual decoder buffer is in a hypothetical reference decoder (HRD), the rate control algorithm may increase or decrease the QP value for a given data block, row, or frame. The type of rate control determines how much the bitrate may vary from frame to frame. Constant bitrate (CBR) rate control allows little or no variation in the target bitrate from frame to frame. Variable bitrate (VBR) rate control still attempts to achieve the target bitrate on average across the entire video but allows the local bitrate for individual frames to exceed the target bitrate by some factor (e.g., 1.5 or 2 times the target bitrate). Constant rate factor (CRF) rate control attempts to maintain the quality of the output bitstream from frame to frame with less concern for the bitrate of the bitstream. However, CRF rate control may be applied with a “max-rate” parameter that governs the maximum bitrate for any given frame, thus achieving rate control similar to VBR for complex videos.
Often, the input parameters of an encoder are specified by some default configurations that generally vary according to capability, complexity, and encoding speed. For example, the open-source x264 encoder for H.264 encoding has a set of ten predefined presets ranging from “ultrafast” (fastest speed, lowest capability) to “placebo” (slowest speed, highest capability) that set the encoding parameters for the encoding. Encoding parameters that can be modified include the GOP length, the number of reference frames for inter-prediction, the maximum number of consecutive B-frames, the usage of B-frames as reference frames, the placement of adaptive B-frames, the motion estimation algorithm, the maximum range for motion estimation, the subpixel motion estimation algorithm (for fine motion estimation), and the allowable partitions for subpartitioning. In addition to the encoding parameters, the target bitrate, which can be thought of as another input parameter, is also specified in many applications as a function of frame size, available network bandwidth, and other considerations.
Using default configurations to set the input parameters for encoding can lead to encoding inefficiencies when the input parameters are not well-matched to the characteristics of the video data. Consider a method where target bitrate is specified based on the frame resolution, for example. In this case, the same target bitrate is applied independently of the content in the video. If the video content has low spatial complexity and low motion, the target bitrate will likely be “too high” and bits will be wasted because acceptable quality could be achieved with fewer bits; if the video content is spatially complex with high motion, the target bitrate will likely be “too low” and the encoding quality will be poor. The solution to this type of “settings mismatch” is to characterize the video data and then perform “smart” encoding with content-adaptive input parameters.
In general, the process of characterizing video data to derive data-adaptive input parameters involves a few essential steps. First, the data needs to be characterized by computing one or more metrics. Second, the metric values need to be converted to decisions about the input parameters. Third, to determine the effectiveness of the process, the modified encodings with data-adaptive input parameters should be compared against “original” encodings with default input parameters.
Different methods and systems have been proposed to adapt input parameters for video encoding based on the video characteristics. In [Carmel, S. et al., “Quality driven video re-encoding,” U.S. patent application Ser. No. 14/912,291, filed Aug. 11, 2014], a metric called block-based coding quality is computed for a given video to be encoded (referred to herein as a source video) and then a decision is made as to how much the target bitrate can be lowered while maintaining an acceptable value of the quality metric. In this case, there is a single metric to compute and a single input parameter to be modified (the target bitrate), and the video to be encoded must be re-encoded in a closed-loop system to obtain the “improved” (lower-bandwidth) encoding. In [Koren, N. et al., “Encoding/transcoding based on subjective video quality preferences,” U.S. patent application Ser. No. 15/049,051, filed Feb. 20, 2016], video quality (VQ) is measured by an “objective VQ compare module” to determine how closely an encoded video matches a user's “VQ profile” (representing the user's aesthetic video preferences), with the results fed back to allow re-encoding of the video at a lower bandwidth or higher quality. Koren et al. gives no details as to what constitutes the metrics in the “objective VQ compare module” or what input parameters are modified in the re-encoding. It is clear, however, that both of the methods described above are closed-loop systems that require multiple encodings of the same source video to obtain the final encoding with improved settings.