A communication system such as a VoIP system implemented over the Internet may be required to serve billions of calling minutes to people around the globe. Nowadays users expect a high quality calling experience. Meeting such high expectations depends on the communication system provider's ability to define, analyze, measure, improve and monitor call quality. This involves the ability to understand the impact and frequency of technical conditions (measured by technical parameters) on the user's subjective call experience; for instance an understanding of the network and media characteristics in categories such as transport quality, quality of service (QoS), quality of media (QoM) and quality of experience (QoE). There are various methods currently in use for objectively assessing media quality.
The simplest methods use basic engineering metrics such as the signal-to-noise ratio (SNR) commonly used in audio and the peak-signal-to-noise ratio (PSNR) used in video. These simple metrics can also be modified to take more account of perceived quality. For example, refinements to PSNR by adaptation to the spatio-temporal complexity of the video have been proposed, resulting in more correlation with human perception. An alternative to PSNR is the structural similarity index (SSIM) which has a higher correlation with subjective quality. Recent work in video coding has aimed at using SSIM as the encoding distortion measure.
The more advanced methods for audio and video quality assessment mimic the entire (and very complex) human hearing or visual system, and try to predict the mean user perceived quality measured by (for instance) mean opinion score (MOS). Examples of the most advanced models today are the speech quality tool in ITU-T P.863 (POLQA), and the video quality tools in ITU-T J.247 and J.341.
The objective test methodologies can be divided into three groups based on the inputs provided to the models: full, reduced and no reference models. This taxonomy takes into account whether a model uses the original audio or video signal as a reference for analysis.
In the full reference models (such as the aforementioned metrics PSNR, SSIM, POLQA) the original audio or video signal is compared to the processed (or so-called degraded) audio or video signal. Based on the comparison the model predicts the user perceived quality. The reduced reference models use only part of the original signal properties for quality assessment. Examples in this category of models include the standardized Video Quality Metric (VQM) designed for MPEG-2 quality assessment.
The no-reference models do not use the original audio or video signal to assess the quality. Instead, these models make assumptions about the properties of the original signal. Perhaps the most well-known no-reference model is the E-model (ITU-T G.107) designed for speech quality assessment. There is a recent extension of the E-model (ITU-T G.1070) that includes both video quality (coding, frame-rate, packet-loss, and display-resolution) and a combination of the audio and video quality (delay and sync) into a total quality score. The audio part in G.1070 is a simplified version of the G.107 model. Both of these models were designed to assist the telecom operators in their network infrastructure design so as to guarantee a specific level of quality. The G.107 E-model has been extended from narrow and wideband usage towards super-wideband usage, supporting the modern speech codecs such as Silk. Further refinements of the G.1070 model also take into account the video content, for instance the spatio-temporal complexity.