For the planning, design, installation, optimization, and monitoring of telecommunication networks providing speech transmission capabilities, the quality experienced by the user of the related service is taken into account. Quality is usually quantified by carrying out perceptual experiments with human subjects in a laboratory environment. For assessing the quality of transmitted speech, test subjects are either put into a listening-only or a conversational situation, experience speech samples under these conditions, and rate the quality of what they have heard on a number of rating scales. The Telecommunication Standardization Sector of the International Telecommunication Union provides guidelines for such experiments, and proposes a number of rating scales to be used, as for instance described in ITU-T Rec. P.800, 1996, ITU-T Rec. P.830, 1996, or in the ITU-T Handbook on Telephonometry, 1992. The most frequently used scale is a 5-point absolute category rating scale on “overall quality”. The averaged score of the subjective judgments obtained on this scale is called a Mean Opinion Score, MOS. MOS scores can be qualified as to whether they have been obtained in a listing-only or conversational situation, and in the context of narrow-band (300-3400 Hz audio bandwidth), wideband (50-7000 Hz) or mixed (narrow-band and wideband) transmission channels, as is described for instance in ITU-T Rec. P.800.1 (2006).
Because of the efforts and costs required to run subjective tests, algorithms have been developed which estimate the subjective rating to be expected in a perceptual experiment on the basis of speech signals, or of parameters characterizing the telecommunication network. Speech signals can be generated artificially, for instance by using simulations, or they can be recorded in operating networks. Depending on whether speech signals at the input of the transmission channel under consideration are available or not, different types of signal-based models can be distinguished:                a full-reference model, which estimates subjective listening-quality scores by calculating a distance or similarity between adequate representations of the input and the output signal, or by deriving a distortion measure from the comparison of input and output signals, and transforming the result on a scale related to subjective quality,        a no-reference model, which estimates subjective listening-quality scores on the basis of the output signal alone; this can be done e.g. by generating an artificial reference within the algorithm, and performing a subsequent signal-comparison analysis, as stated above, and        a conversational quality model, which estimates quality scores for a listening-only, a talking-only, and/or a conversational situation.        
Several forms of full-reference models exist for speech and audio transmission channels. They usually consist of a pre-processing step for the input and the output signals, a transformation into an internal representation, a comparison step resulting in an index, followed by integration and transformation steps resulting in an estimated quality score.
For narrow-band speech transmission, full-reference models include the PESQ model described in ITU-T Recommendation P.862 (2001), its precursor PSQM described in ITU-T Recommendation P.861 (1998), the TOSQA model described in ITU-T Contribution Com 12-19 (2001), as well as PAMS described in “The Perceptual Analysis Measurement System for Robust End-to-end Speech Quality Assessment” by A. W. Rix and M. P. Hollier, Proc. IEEE ICASSP, 2000, vol. 3, pp. 1515-1518. Further models are described in “Objective Modelling of Speech Quality with a Psychoacoustically Validated Auditory Model” by M. Hansen and B. Kollmeier, 2000, J. Audio Eng. Soc., vol. 48, pp. 395-409, “Objective Estimation of Perceived Speech Quality—Part I: Development of the Measuring Normalizing Block Technique” by S. Voran, IEEE Trans. Speech Audio Process., 1999, vol. 7, no. 4, pp. 371-382, “Instrumentelle Verfahren zur Sprachqualitätsschätzung—Modelle auditiver Tests” by J. Berger, 1998, PhD thesis, University of Kiel, Shaker Verlag, Aachen, “Psychoakustisch motivierte Maβe zur instrumentellen Sprachgütebeurteilung” by M. Hauenstein, 1997, PhD thesis, University of Kiel, Shaker Verlag, Aachen, and “An objective Measure for Predicting Subjective Quality of Speech Coders” by S. Wang, A. Sekeyand A. Gersho, 1992, IEEE J. Sel. Areas Commun., vol. 10, no. 5, pp. 819-829.
The model by Wang, Sekey and Gersho uses a Bark Spectral Distortion (BSD) which does not include a masking effect.
The PSQM model (Perceptual Speech Quality Measure) comes from the PAQM model (Perceptual Audio Quality Measure) and was specialized only for the evaluation of speech quality. The PSQM includes as new cognitive effects the measure of noise disturbance in silent interval and an asymmetry of perceptual distortion between components left or introduced by the transmission channel. The model by Voran, called Measuring Normalizing Block, used an auditory distance between the two perceptually transformed signals. The model by Hansen and Kollmeier uses a correlation coefficient between the two transformed speech signals to a higher neural stage of perception. The PAMS (Perceptual Analysis Measurement System) model is an extension of the BSD measure including new elements to rule out effects due to variable delay in Voice-over-IP systems and linear filtering in analogue interfaces. The TOSQA model (Telecommunication Objective Speech Quality Assessment; Berger, 1998) assesses an end-to-end transmission channel including terminals using a measure of similarity between both perceptually transformed signals. The PESQ (Perceptual Evaluation of Speech Quality) model is a combination of two precursor models, PSQM and PAMS including partial frequency response equalization.
For wideband (50-7000 Hz) or mixed narrow-band and wideband speech transmission channels, only few proposals have been made. The ITU-T currently recommends an extension of its PESQ model in Rec. P.862.2 (2005), called wideband PESQ, WB-PESQ, which mainly consists in replacing the input filter characteristics of PESQ by a high-pass filter, and applying it to both narrow-band and wideband speech signals. In addition, the 2001 version of TOSQA (ITU-T Contr. COM 12-19, 2001) has shown to be able to estimate MOS also in a wideband context, as the WB-PAMS (ITU-T Del. Contr. D.001, 2001).
Several studies are described in the literature to evaluate the consistency of WB-PESQ estimations with subjective judgments, as for instance ITU-T Del. Contr. D.070 (2005), “Objective Quality Assessment of Wideband Speech by an Extension of the ITU-T Recommendation P.862” by A. Takahashi et al., 2005, in Proc. 9th Int. Conf. on Speech Communication and Technology (Interspeech Lisboa 2005), Lisbon, pp. 3153-3156, “Objective Quality Assessment of Wideband Speech Coding” by N. Kitawaki et al., 2005, in IEICE Trans. on Commun., vol. E88-B(3), pp. 1111-1118, or “Analysis of a Quality Prediction Model for Wideband Speech Quality, the WB-PESQ” by N. Côté et al., 2006, in: Proc. 2nd ISCA Tutorial and Research Workshop on Perceptual Quality of Systems, Berlin, pp. 115-122.
The evaluation procedure usually consists in analyzing the relationship between auditory judgments obtained in a listening-only test, MOS_LQS (MOS Listening Quality Subjective), and their corresponding instrumentally-estimated MOS_LQO (MOS Listening Quality Objective) scores. For example, in Takahashi et al. (2005), three wideband speech codecs were evaluated with WB-PESQ, and a bias was found for the G.722.1 codec, in that MOS_LQO is significantly lower than MOS_LQS. The same effect was observed in Kitawaki et al. (2005) for the G.722.2 codec, although the average correlation coefficient is about 0.90. WB-PESQ was shown to be able to predict the codec ranking in the listeners' judgments, but was not able to quantify the perceptual difference between the codecs.
The following table shows Pearson correlation coefficients of the database AQUAVIT (AQUAVIT—Assessment of Quality for Audio-Visual Signals over Internet and UMTS, Eurescom Project P.905, March 2001) for three wideband models:
Test:Bandwidth:WB-PESQTOSQA-2001WB-PAMS1Mixed Band0.9520.9660.9462aNarrow Band0.9810.9540.9812bWide Band0.9770.9820.992
As can be seen from this data the known models already provide estimated quality scores with significant correlation. However, the models typically do not have the same accuracy for narrowband- and wideband-transmitted speech. Furthermore, if a poor quality of a transmission path is detected no information on the source of the quality loss can be derived from the estimated quality score.