Digital video capabilities can be incorporated into a wide range of apparatuses, including digital televisions, digital direct broadcast systems, wireless broadcast systems, personal digital assistants (PDAs), laptop or desktop computers, tablet computers, electronic book readers, digital cameras, digital recording apparatuses, digital media players, video gaming apparatuses, video game consoles, cellular or satellite radio telephones, video conferencing apparatuses, video streaming apparatuses, and the like. Digital video apparatuses implement video compression technologies such as video compression technologies described in standards defined by Moving Picture Experts Group (MPEG)-2, MPEG-4, International Telecommunication Union Telecommunication Standardization Sector (ITU-T) H.263, ITU-T H.264/MPEG-4 Part 10, Advanced Video coding (AVC), ITU-T H.265: the High Efficiency Video Coding (HEVC) standard, and extensions of such standards to transmit and receive digital video information more efficiently. By implementing such video coding technologies, a video apparatus can transmit, receive, encode, decode and/or store digital video information more efficiently.
In the field of video coding, a concept of a frame means an entire picture. After being formed into a video format frame by frame according to a particular sequence and frame rate, pictures may be played. When the frame rate reaches a particular rate, a time interval between two frames is less than a resolution limit of human eyes, transient persistence of vision occurs, and therefore the pictures appear to be moving on a screen. The basis on which a video file can be compressed is compression coding of a single-frame digital picture. A digitized picture has lots of repeatedly indicated information that is referred to as redundant information. A frame of picture usually has many parts having a same or similar spatial structure. For example, a close association and a similarity usually exist between colors of sampling points of a same object or background. In a multi-frame picture group, a great correlation basically exists between a frame of picture and a previous frame or a next frame of the frame of picture, and a difference between pixel values for describing information is very small. These are all parts that can be compressed. Similarly, the video file includes not only spatially redundant information but also lots of temporally redundant information. This is caused by a composition structure of a video. For example, a frame rate of video sampling is usually 25 frames/second to 30 frames/second. 60 frames/second may occur in a special case. That is, a sampling time interval between two neighboring frames is at least 1/30 second to 1/25 second. In such a short time, massive similar information basically exists in all pictures obtained by means of sampling, and a great association exists between the pictures. However, independent recording is performed in an original digital video recording system, and such features as continuity and similarity are not considered or used. Consequently, a quite large quantity of repeated and redundant data is caused. In addition, researches have indicated that a part that can be compressed, that is, visual redundancy, also exists in video information from the perspective of a psychological feature, that is, visual sensitivity of human eyes. The visual redundancy means properly compressing a video bit stream using a physiological property that human eyes are relatively sensitive to a luminance change but relatively insensitive to a chrominance change. In a high-luminance area, sensitivity of human eye vision to a luminance change presents a descending trend. The human eye vision turns to be relatively sensitive to an edge part of an object and relatively insensitive to an inner area, and relatively sensitive to an entire structure and relatively insensitive to a change of inner details. Video picture information is used to eventually serve humans. Therefore, compression processing may be performed on original video picture information by fully using these features of human eyes to achieve a more desirable compression effect. In addition to the spatial redundancy, the temporal redundancy, and the visual redundancy that are mentioned above, a series of redundant information such as redundancy of information entropy, structure redundancy, knowledge redundancy, and importance redundancy may exist in the video picture information. An objective of video compression encoding is to remove redundant information from a video sequence using various technologies and methods in order to reduce storage space usage and save transmission bandwidth.
In terms of the current state of technical development, a video compression processing technology mainly includes intra-frame prediction, inter-frame prediction, transform and quantization, entropy encoding, deblocking filtering processing, and the like. In an international universal range, there are mainly four types of mainstream compression coding schemes in existing video compression encoding standards, chroma subsampling, predictive coding, transform coding, and quantization coding.
Chroma subsampling: The scheme fully uses visual and psychological features of human eyes, and starts to attempt to maximally reduce, from bottom-layer data indication, a data volume described by a single element. Luminance-chrominance-chrominance (YUV) color coding is mostly used in a television system and is a standard widely used in a European television system. A YUV color space includes a luminance signal Y and two chrominance signals U and V. The three components are independent of each other. An indication manner in which YUV color modes are separate from each other is more flexible, occupies a small quantity of bandwidth for transmission, and is advantageous over a conventional red green blue (RGB) color model. For example, a YUV 4:2:0 form indicates that a quantity of two chrominance components U and V is only a half of a quantity of luminance components Y in both horizontal and vertical directions, that is, in four pixel sampling points, there are four luminance components Y and only one chrominance component U and one chrominance component V. In such indication, the data volume is further reduced and only accounts for 33% of an original data volume approximately. Achieving an objective of video compression in such a manner of chroma subsampling and using physiological and visual characteristics of human eyes is one of widely used video data compression manners at present.
Predictive coding: A current to-be-encoded frame is predicted using data information of a previously encoded frame. A predictor is obtained by means of prediction and is not exactly equal to an actual value. A residual value exists between the predictor and the actual value. When prediction is more appropriate, the predictor is closer to the actual value and the residual value is smaller. In this way, a data volume may be greatly reduced by encoding the residual value. An initial picture is restored or reconstructed by adding the residual value to the predictor during decoding on a decoder side. This is a basic concept and method of the predictive coding. In a mainstream coding standard, the predictive coding includes two basic types, intra-frame prediction and inter-frame prediction.
Transform coding: Original spatial-domain information is not directly encoded. Instead, a sample value of information is transformed from a current domain into another manually defined domain (which is usually referred to as a transform domain) according to a form of transform function, and then compression coding is performed according to a distribution feature of the information in the transform domain. A reason for the transform coding is that a data correlation of video picture data is usually large in a spatial domain, resulting in existence of massive redundant information. Consequently, direct encoding requires a large quantity of bits. The data correlation is greatly reduced in the transform domain such that redundant information for encoding is reduced, and a data volume needed for the encoding is greatly reduced accordingly. In this way, a relatively high compression ratio may be obtained, and a relatively desirable compression effect may be achieved. Typical transform coding includes Karhunen-Loeve (K-L) transform, Fourier transform, and the like. Integer discrete cosine transform (DCT) is a transform coding scheme commonly used in many international standards.
Quantization coding: Actually, data is not compressed in the transform coding mentioned above, and a quantization process is a powerful means for data compression and is a main reason for data “loss” in lossy compression. The quantization process is a process of forcibly planning an input value having a relatively large dynamic range into an output value having a relatively small dynamic range. A quantized input value has a relatively large range, and therefore requires a relatively large quantity of bits for indication, while an output value obtained after “forcible planning” has a relatively small range, and therefore requires only a small quantity of bits for indication. Each quantized input is normalized into a quantized output, that is, quantized into an order of magnitude. Such order of magnitude is usually referred to as a quantization level (which is usually specified by an encoder).
In a coding algorithm based on a hybrid coding architecture, the foregoing compression coding schemes are mixed for use. An encoder control module selects, according to local features of different picture blocks in a video frame, encoding modes used for the picture blocks. Frequency domain prediction or spatial domain prediction is performed on a block on which intra-frame prediction encoding is performed, and motion compensation prediction is performed on a block on which inter-frame prediction encoding is performed. Then, transform and quantization processing is performed on a predicted residual to form a residual coefficient. At last, a final bitstream is generated using an entropy encoder. To avoid accumulation of prediction errors, a reference signal of intra-frame prediction or inter-frame prediction is obtained using a decoding module on an encoder side. Dequantization and an inverse transform are performed on the residual coefficient obtained after the transform and quantization, to reconstruct a residual signal. The residual signal is then added to the reference signal of prediction to obtain a reconstructed picture. Pixel correction is performed on the reconstructed picture by means of loop filtering in order to improve encoding quality of the reconstructed picture.