Digital images are subject to a wide variety of distortions during acquisition, processing, compression, storage, transmission and reproduction, any of which may result in a degradation of visual quality. For applications in which images are ultimately to be viewed by human beings, the most reliable method of quantifying visual image quality, is through subjective evaluation. In practice, however, subjective evaluation is usually too inconvenient, time-consuming and expensive. Objective image quality metrics may predict perceived image quality automatically. The simplest and most widely used quality metric is the mean squared error (MSE), computed by averaging the squared intensity differences of distorted and reference image pixels, along with the related quantity of peak signal-to-noise ratio (PSNR). But they are found to be poorly matched to perceived visual quality. In the past decades, a great deal of effort has gone into the development of advanced quality assessment methods, among which the structural similarity (SSIM) index achieves an excellent trade-off between complexity and quality prediction accuracy, and has become the most broadly recognized image/video quality measure by both academic researchers and industrial implementers.
In general, video coding schemes often involve finding the best trade-off between data rate R and the allowed distortion D—the so called rate-distortion optimization (RDO). An overall rate-distortion cost function may be defined with both R and D terms, and a Lagrange parameter may be used to control the relative weights of the two terms. The goal of RDO is to find the best Lagrange parameter. Existing video coding techniques use the sum of absolute difference (SAD) or sum of square difference (SSD) to define distortion D, which have been widely criticized in the literature because of their poor correlation with perceptual image quality. In order to maximize the perceptual quality of a compressed video stream for given data rate or to minimize the data rate without losing perceptual quality, it is desirable to use a perceptually more meaningful image quality measure such as SSIM to define D. However, when the distortion function D in RDO is defined using a perceptual quality measure such as the SSIM index, finding the optimal Lagrange parameter can be difficult due to more complex mathematical structure of the SSIM index.
In general, a coded video stream may be composed of multiple groups of pictures (GOPs). Each GOP starts with an I-frame, that is coded independently, and includes all frames up to, but not including, the next I-frame. For example, the MPEG4/H.264 AVC standard encodes pictures with a fixed GOP length in the general reference encoder, even though it allows variable GOP length.
A video sequence usually consists of various scenes where every scene can be categorized in terms of its visual information, content complexity and activity. As a result, the whole video sequence can be divided into GOPs based on properties of the visual content in such a way that the pictures in each GOP have similar perceptual importance. In order to achieve good perceptual video quality within a given rate budget, it is useful to divide the bits among various GOPs considering the relative perceptual importance of each GOP. This can be achieved by adjusting the quantization level of each GOP based on its perceptual importance. Multi-pass encoding is a video encoding technique where the first encoding pass analyzes the video and logs down information which can then be used in the second and subsequent passes, to adjust the bit-rate of each GOP to optimize for the maximum perceptual video quality. However, a multi-pass system cannot be employed for real-time applications because it. In such a case, a single-pass system can be used to perform the bit rate allocation among various GOPs based on already encoded frames.
A practical approach to develop an objective video quality assessment (VQA) method with good accuracy is to employ an IQA method that has low computational complexity and achieves high prediction accuracy such as SSIM. The final quality score can be obtained by weighted averaging of quality scores of individual pictures/frames in a video. Previous studies have shown that assigning larger weight to high distortion regions generally has positive effect on the performance of IQA/VQA methods. Since, the final score is mainly influenced by the frames with higher distortion, therefore the similar perceptual quality can be achieved by using high quantization level for the GOPs with high quality such that IQA performance is more uniform over all the frames in the video sequences. As a result, similar perceptual quality can be achieved by using significantly lower bit-rate.