Visual distortion metrics play an important role on monitoring the quality of broadcasted image/video, controlling compression efficiency and improving image enhancement processes. There are generally two classes of quality or distortion assessment approaches. The first class is based on mathematically defined measurements, such as the widely used mean square error (MSE), peak signal to noise ratio (PSNR), etc. The second class is based on measuring the distortion by simulating the human visual system (HVS) characteristics.
In the first class approach, the definition of MSE is given by
  MSE  =            1              N        2              ⁢                  ∑        i            ⁢                        ∑          j                ⁢                              (                                          c                                  i                  ,                  j                                            -                                                c                  ^                                                  i                  ,                  j                                                      )                    2                    wherein ci,j and ĉi,j is a pixel value in an original image and a distorted image, respectively. The definition of PSNR is
  PSNR  =      10    ⁢                  ⁢          log      10        ⁢                  255        2            MSE      
The advantage of the first class approach is that it is mathematically simple and low in computational complexity. For this reason, the first class approach is widely adopted.
The second class approach however aims at perception results which are closer to human vision, and hence, leads to better accuracy in visual assessment and information processing. However, due to incomplete understanding of the HVS and lag in incorporating physiological and/or psychological findings to the HVS, the performance of the second class approach is still not satisfactory.
There are physiological and psychological evidences that an observer who looks at an image or video does not pay attention to all visual information of the image or video, but only focuses on certain regions. Such visual attention information from the observer is used in HVS in many applications, e.g. for computation of a search process in visual perception, or to evaluate the quality of an image or video.
Visual attention may be implemented by either a bottom-up process or a top-down process. In the bottom-up process, visual attention is based on stimuli from visual features of the image/video, and a saliency map for the image/video is formed based on such stimuli. Examples of visual feature based stimuli include illumination, color, motion, shape, etc. In the top-down process, the saliency map for the image/video is formed based on prior/domain knowledge or indication from other known information like sound.
[1] discloses a method that combines three factors, namely loss of correlation, luminance distortion and contrast distortion, to measure distortion of an image.
[2] proposes a no-reference quality metrics 100 as shown in FIG. 1. Distorted image/video 101 is received by an artifact extraction unit 102 to detect the distribution of blurring and blockiness of the image/video 101. Such distribution properties of blurring and blockiness are discriminated in a discrimination unit 103 to generate an output signal 104 representing the distortion value of the distorted image/video 101.
The methods according to [1] and [2] belong to the first class approach, and hence, do not provide results which are close to human perception as compared to the second class approach.
[3] proposes a metric 200 based on video decomposition and spatial/temporal masking as shown in FIG. 2. A reference image/video 201 and a distorted image/video 202 are each received by a signal decomposition unit 203,204. The respective decomposed signals 205,206 are each received by a contrast gain control unit 207,208 for spatial/temporal masking of the decomposed signal 205,206. The respective processed signals 209,210 are processed by a detection and pooling unit 111 to generate an output signal 212 representing the distortion value of the distorted image/video 202.
[4] uses a neural network to combine multiple visual features for measuring the quality of an image/video as shown in FIG. 3. Reference image/video 301 and distorted image/video 302 are input to a plurality of feature extraction units 303 to extract various features of the image/video 301,302. The extracted features 304 are received by a neural network 305 to generate the distortion value 305 of the distorted image/video 302.
[5] discloses a method for evaluating the perceptual quality of a video by assigning different weights to several visual stimuli.
The references [4] and [5] process the whole image or video equally, and hence, is not computational efficient as insignificant portions of the image/video are also processed.
[6] uses several bottom-up visual stimuli to determine regions of high visual attention in an image/video. The features determined from these bottom-up visual stimuli are weighted and accumulated to form an Importance Map indicating the regions of high visual attention. This method does not result in very good quality assessment of the image/video as only bottom-up features are determined. Furthermore, high visual attention of a region does not always mean that the region should be coded with a high quality.
[7] discloses a method similar to [6], but uses both bottom-up and top-down visual stimuli to determine regions of high visual attention in the image/video. The determined features obtained from the bottom-up and top-down visual stimuli are integrated together using a Bayes network, wherein the Bayes network has to be trained prior to the integration. As mentioned, high visual attention of a region does not always mean that the region should be coded with a high quality. Moreover, the use of a Bayes network for integrating the features of the image/video is complex as the Bayes network needs to be trained prior to integrating the features.
Therefore, a more accurate and yet robust method of assessing the quality or distortion of an image or video is desired.