This invention relates to non-intrusive speech-quality assessment using vocal-tract models, in particular for testing telecommunications systems and equipment.
Customers are now able to choose a telecommunications service provider based upon price and quality of service. The decision is no longer fixed by monopolies or restricted by limited technology. A range of services is available with differing costs and quality of service. Service providers need the capability to predict customers' perceptions of quality so that networks can be optimized and maintained. Traditionally, networks have been characterized using linear assessment techniques, tone-based signals; and simple engineering metrics, such as signal-to-noise ratio. As networks become more complex, including non-linear elements such as echo cancellers and compressive speech coders, there is a requirement for assessment systems which bear a closer relationship to the human perception of signal quality. This role has typically been filled with expensive and time-consuming subjective tests using human subjects. Such tests are employed for commissioning new network elements, during the design of new coding algorithms, and for testing different network topologies.
Recent advances in perceptual modeling have led to the construction of objective auditory models, which can generate predictions of perceived telephony speech quality from a listener's perspective. These assessment techniques require a known test stimulus to excite a network connection and then use a perceptually-motivated comparison between a reference version of the known test stimulus, and a version of the same stimulus as degraded by the system under test, to provide a measure of the quality of the degraded version as it would be perceived by a human listener.
FIG. 1 shows the principle of the BT Laboratories Perceptual Analysis Measurement System (PAMS), disclosed in International Patent Applications W094/00922, W095/01011, and W095/15035. In this system the reference signal 11 comprises a speech-like test stimulus which is used to excite the connection under test 10 to generate a degraded signal 12. The two signals are then compared in the analysis process 1 to generate an output 18 indicative of the subjective impact of the degradation of the signal 12 when compared with the reference signal 11.
Such assessment techniques are known as “intrusive” because they require the withdrawal of the connection under test 10 from normal service so that it can be excited with a known test stimulus 11. Removing a connection from normal service renders it unavailable to customers and is expensive to the service provider. In addition, the conditions that generate distortions and errors could be due to network loading levels that are only present at peak times. An out-of-hours assessment could therefore generate artificial quality scores. This means that reliable intrusive testing is relatively expensive in terms of capacity on a customer's network connection.
In general, it would be preferable to continuously monitor the quality of speech at a particular point in the network. In this case, a “non-intrusive” solution is attractive, utilizing the in-service signal to make predictions of quality. Given this information, network traffic can be re-routed through less congested parts of the network if quality drops. A fundamentally different approach is required to analyse a degraded speech signal without a reference signal. The entire process takes place “downstream” of the equipment under test. Non-intrusive techniques are discussed in International Patent Specifications W096/06495 and W096/06496. Current non-intrusive assessment equipment performs measurements such as echo, delay, noise and loudness in an attempt to predict the clarity of a connection. However, a customer's perception of speech quality is also affected by distortions and irregularities in the speech structure, which are not described by such simple measures.
International Patent Specification W097/05730 (now also U.S. Pat. No. 6,035,270) describes a system of this general type which aims to generate an output indicative of how plausible it is that the passing audio stream was generated by the human vocal production system. This is achieved by comparing the audio stream with a spectral model representative of the sounds capable of production by the human vocal system. This process requires pattern recognition to distinguish the spectral characteristics representative of speech and of distortion, so that their presence can be identified.
These analysis processes use spectral models, although physiological models 30 have previously been used for speech synthesis—see for example the use of each types of model for these respective purposes in International patent specifications W096/06496 and W097/00432. Unlike a physiological model, spectral models are empirical, and have no intrinsic basis on which to identify what sounds the vocal tract is capable of producing. However, the physiological articulatory models used in the synthesis of continuous speech utilize constraints to ensure the generated speech is smooth and natural sounding. These models would therefore be unsuitable for an assessment process, since in such a process the parameters generated must also be capable of representing “illegal” vocal-tract shapes that the constraints used by such a synthesis model would ordinarily remove. It is the regions that are in error or distorted that contain the information for such an assessment; to remove this at the parameterization stage would make a subsequent analysis of their properties redundant.