In video processing, given an original uncompressed video, the purpose of a video encoder is to produce a compressed representation of the original video that is smaller in size but can be decompressed to produce a video closely resembling the original video. When designing an encoder, there is a trade-off between encoder complexity and compression efficiency: the more time an encoder has at its disposal, the more complex methods can be used for compression, the better its output will usually be for a given bitrate. In some applications, such as videoconferencing, it is useful to have an encoder working in real-time, making it challenging to achieve good compression.
High Efficiency Video Coding (HEVC), also referred to as H.265, is a video coding standard being developed in Joint Collaborative Team-Video Coding (JCT-VC). JCT-VC is a collaborative project between Moving Picture Experts Group (MPEG) and International Telecommunication Union-Telecommunication Standardization Sector (ITU-T). Currently, an HEVC Model (HM) is defined that includes a number of tools and is considerably more efficient than the existing video coding standard H.264/Advanced Video Coding (AVC).
HEVC is a block-based hybrid video coded that uses both inter prediction (prediction from previous coded pictures) and intra prediction (prediction from previous coded pixels in the same picture). Each picture is divided into quadratic treeblocks (corresponding to macroblocks in H.264/AVC) that can be of size 16×16, 32×32 or 64×64 pixels. A variable CtbSize is used to denote the size of treeblocks expressed as number of pixels of the treeblocks in one dimension i.e. 16, 32 or 64.
Hence when encoding a frame of video with H.265, the frame is split into the treeblocks, each treeblock is then hierarchically split into Coding Units (CUs), ranging in size from 64×64 to 8×8 pixels.
Compressing a CU is done in two steps: first the pixel values in the CU are predicted from previously coded pixel values either in the same frame or in previous frames. After prediction, the difference between the predicted pixel values and the actual values is calculated and transformed.
Furthermore, prediction can be performed for an entire CU at once, or on smaller parts separately. This is done by defining Prediction Units (PUs), which may be the same size as the CU for a given set of pixels, or further split hierarchically into smaller PUs. Each PU defines separately how it will predict its pixel values from previously coded pixel values.
In a similar fashion, the transforming of the prediction error is done in Transform Units (TUs), which may be the same size as CUs or split hierarchically into smaller sizes. The prediction error is transformed separately for each TU.
A H.265 encoder can be implemented by restricting the size of the CUs, PUs and TUs so that they are all either 16×16 or 8×8 pixels. This gives 3 options for each 16×16 block of pixels:
1. 16×16 CU and PU with 16×16 TU,
2. 16×16 CU and PU with four 8×8 TUs or
3. four 8×8 CUs each with a single 8×8 PU and TU.
Note that in all the above cases the CU is the same size as the PU. However, another combination of 16×16 and 8×8 blocks is possible for which this is not true: this is the case where the CU is one 16×16 block, the PUs are four 8×8 blocks and where the TU is a single 16×16 block. By disallowing this combination in the encoder, it is possible to use the simplifying assumption that the size of the PU is always the same as the size of the CU. The division of the treeblocks 100 into CUs 110 and PUs 120 and TUs 130 in the CUs is illustrated in FIG. 1.
The most straight-forward way of determining the size of CUs, TUs and PUs is to try different sizes, measure the amount of bits used and the error for each size, and choose the one which is best according to some metric. This is what, for example, the reference software for H.265 does.
A problem with the exemplified solutions above is that evaluating different block-sizes are costly. Evaluating just one transform unit size takes around 15% of the total encoding time. This means that evaluating two transform sizes would increase encoding time by around 15%, which is certainly a problem when fast encoding is a key requirement. Even worse, evaluating one prediction unit size takes around 30% of the total encoding time.