Speech quality assessment is an important problem in mobile communications. The quality of a speech signal is a subjective measure. It can be expressed in terms of how natural the signal sounds or how much effort is required to understand the message. In a subjective test, speech is played to a group of listeners, who are asked to rate the quality of this speech signal, see [1], [2].
The most common measure for user opinion is the mean opinion score (MOS), obtained by averaging the absolute category ratings (ACR). In ACR, listeners compare the distorted signal with their internal model of high quality speech. In degradation MOS (DMOS) tests, the subjects listen to the original speech first, and then are asked to select the degradation category rating (DCR) corresponding to the distortion of the processed signal. DMOS tests are more common in audio quality assessment, see [3], [4].
Assessment of the listening quality as described in [1]-[4] is not the only form of quality of service (QoS) monitoring. In many cases conversational subjective tests, see [2], are the preferred method of subjective evaluation, where participants hold conversations over a number of different networks and vote on their perception of conversational quality. An objective model of conversational quality can be found in [5]. Yet another class of QoS monitoring consists of intelligibility tests. The most popular intelligibility tests are the Diagnostic Rhyme Test (DRT) and Modified Rhyme Test (MRT), see [6].
Subjective tests are believed to give the “true” speech quality. However, the involvement of human listeners makes them expensive and time consuming. Such tests can be used only in the final stages of developing the speech communication system and are not suitable to measure QoS on a daily basis.
Objective tests use mathematical expressions to predict speech quality. Their low cost means that they can be used to continuously monitor the quality over the network. Two different test situations can be distinguished:                Intrusive, where both the original and distorted signals are available. This is illustrated in FIG. 1, where a reference signal is forwarded to a system under test, which distorts the reference signal. The distorted signal and the reference signal are both forwarded to an intrusive measurement unit 12, which estimates a quality measure for the distorted signal.        Non-intrusive (sometimes also denoted “single-ended” or “no-reference”), where only the distorted signal is available. This is illustrated in FIG. 2. In this case a non-intrusive measurement unit 14 estimates a quality measure directly from the distorted signal without access to the reference signal.        
The simplest class of intrusive objective quality measures are waveform-comparison algorithms, such as signal-to-noise ratio (SNR) and segmental signal-to-noise ratio (SSNR). The waveform-comparison algorithms are simple to implement and require low computational complexity, but they do not correlate well with subjective measurements if different types of distortions are compared.
Frequency-domain techniques, such as the Itakura-Saito (IS) measure, and the spectral distortion (SD) measure are widely used. Frequency-domain techniques are not sensitive to a time shift and are generally more consistent with human perception, see [7].
A significant number of intrusive perceptual-domain measures have been developed. These measures incorporate knowledge of the human perceptual system. Mimicry of human perception is used for dimension reduction and a “cognitive” stage is used to perform the mapping to a quality scale. The cognitive stage is trained by means of one or more databases. These measures include the Bark Spectral Distortion (BSD), see [8], Perceptual Speech Quality (PSQM), see [9], and Measuring Normalizing Blocks (MNB), see [10], [11]. Perceptual evaluation of speech quality (PESQ), see [12], and perceptual evaluation of audio quality (PEAQ), see [13], are standardized state-of-the-art algorithms for intrusive quality assessment of speech and audio, respectively.
Existing intrusive objective speech quality measures may automatically assess the performance of the communication system without the need for human listeners. However, intrusive measures require access to the original signal, which is typically not available in QoS monitoring. For such applications non-intrusive quality assessment must be used. These methods often include both mimicry of human perception and/or a mapping to the quality measure that is trained using databases.
An early attempt towards non-intrusive speech quality measure based on a spectrogram of the perceived signal is presented in [14]. The spectrogram is partitioned, and variance and dynamic range calculated on a block-by-block basis. The average level of variance and dynamic range is used to predict speech quality.
The non-intrusive speech quality assessment reported in [15] attempts to predict the likelihood that the passing audio stream is generated by the human vocal production system. The speech stream under assessment is reduced to a set of features. The parameterized data is used to estimate the perceived quality by means of physiologically based rules.
The measure proposed in [16] is based on comparing the output speech to an artificial reference signal that is appropriately selected from a optimally clustered codebook. In the Perceptual Linear Prediction (PLP), see [17], coefficients are used as a parametric representation of the speech signal. A fifth-order all-pole model is performed to suppress speaker-dependent details of the auditory spectrum. The average distance between the unknown test vector and the nearest reference centroids provides an indication of speech degradation.
Recent algorithms based on Gaussian-mixture probability models (GMM) of features derived from perceptually motivated spectral-envelope representations can be found in [18] and [19]. A novel, perceptually motivated speech quality assessment algorithm based on temporal envelope representation of speech is presented in [20] and [21].
The International Telecommunication Union (ITU) standard for non-intrusive quality assessment, ITU-T P.563, can be found in [22]. A total of 51 speech features are extracted from the signal. Key features are used to determine a dominant distortion class, and in each distortion class a linear combination of features is used to predict a so-called intermediate speech quality. The final speech quality is estimated from the intermediate quality and 11 additional features.
The above listed measures for quality assessment are designed to predict the effects of many types of distortions, and typically have high computational complexity. Such algorithms will be referred to as general speech quality predictors. It has been shown that non-intrusive quality prediction is possible at much lower complexity if it is assumed that the type of distortion is known, see [23]. However, the latter class of measures is likely to suffer from poor prediction performance if the expected working conditions are not met.