This invention relates to the measurement of quality of a sound signal, and more specifically a speech signal. Objective processes for this purpose are currently under development and are of application in prototype testing, pre-delivery testing of components, and in-service testing of installed equipment. They are most commonly used in telephony, but are also of application in other systems used for carrying speech signals, for example public-address systems.
The present applicant has a number of patents and applications relating to this technical field, most particularly European Patent 0647375, granted on Oct. 14th 1998. In this system, a signal degraded by the system under test is compared with a reference signal, which has not passed through the system under test, to identify audible errors in the degraded signal. These audible errors are assessed to determine their perceptual significancexe2x80x94that is, errors of types which are considered significant by human listeners are given greater weight than are those which are not considered so significant. Since only audible errors are assessed, inaudible errors, which are perceptually irrelevant, are not assessed.
The automated system provides an output comparable to subjective quality measures originally devised for use by human subjects. More specifically, it generates two values, YLE and YLQ, equivalent to the xe2x80x9cMean Opinion Scoresxe2x80x9d (MOS) for xe2x80x9clistening effortxe2x80x9d and xe2x80x9clistening qualityxe2x80x9d, which would be given by a panel of human listeners when listening to the same signal, as will be discussed later. The use of an automated system allows for more consistent assessment than human assessors could achieve, and also allows the use of compressed and simplified test sequences, and multilingual test sequences, which give spurious results when used with human assessors because such sequences do not convey intelligible content.
Such automated systems require a known (reference) signal to be played through a distorting system (the telephone network) to derive a degraded signal, which is compared with an undistorted version of the reference signal. Such systems are known as xe2x80x9cintrusivexe2x80x9d measurement systems, because whilst the test is carried out the system under test cannot carry live (revenue-earning) traffic.
An auditory transform of each signal is taken, to emulate the response of the human auditory system (ear and brain) to sound. The degraded signal is then compared with the reference signal in the perceptual domain, in which the subjective quality that would be perceived by a listener using the network is determined from parameters extracted from the transforms.
A suitable test signal is disclosed in International Patent Specification WO/95/01011 (EP0705501) and comprises a sequence of speech-like sounds, selected to be representative of the different types of phonetic sounds that the system under test may have to handle, presented in a sequence. The sounds are selected such that typical transitions between individual phonetic elements are represented. Typical speech comprises a sequence of utterances separated by silent periods, as the speaker pauses to breathe, or listens to the other party to the conversation. These silent periods, and the transitions between utterances and silent periods, are also modelled by the test signal.
This existing system reliably assesses most speech carrier technologies employed within conventional analogue and digital switched telephone networks. In such networks a dedicated connection is provided between the two parties to a call, for the duration of that call, and all speech is carried over that connection. However, connectionless packet-based speech transmission systems are beginning to be introduced, in particular for use in the xe2x80x9cInternetxe2x80x9d and companies"" internal xe2x80x9cIntranetsxe2x80x9d. In a connectionless packet-based system each transmission is divided into a series of data xe2x80x9cpacketsxe2x80x9d, which travel independently from one user to the other. Intermediate nodes in the network transmit the packets to each other according to address information carried in each packet. However, according to the demands of other traffic on the various links between such nodes, and the available capacity on those links, different packets may be delayed, or may travel by different routes to reach the same destination. Consequently, end-to-end times vary from one packet to another. For the transmission of data such as text, or downloading of computer files for the recipient to use subsequently, such variations in end-to-end times are of little consequence. However, when used for real-time speech, these variations can affect the clarity of the speech as perceived by the user.
Various proposals have been made to try to minimise the delay to a level which does not interfere with conversation and comprehensionxe2x80x94see for example the present applicant""s International Patent Application WO99/12329, and the article by R Barnett in xe2x80x9cElectronics and Communication Engineering Journalxe2x80x9d, October 1997, entitled xe2x80x9cConnectionless ATMxe2x80x9d. However, it is fundamental to such connectionless systems that some variation in the residual delay will occur. A single speech utterance is typically assembled from the information carried in several packets. However, variations in the delay between individual packets will in general not be apparent in the resulting utterance, as the slowest packet generally determines the delay to the utterance as a whole. However, the delay to each complete utterance can vary considerably between one utterance and the next, as buffer lengths are normally adjusted during periods of silence.
Changes to the delay occurring during the course of an utterance, for example because part of the utterance is missing, will be more apparent in the resulting utterance.
In addition to changes in residual delay, transmission systems are now beginning to come into use in which changes in other characteristics, such as level (signal amplitude), can occur. See ITU-T draft recommendation G. 169.
The human brain is insensitive to small changes in delay and amplitude between speech events, so these variations may be imperceptible to a human listener, provided the magnitude of the effect is not such as to interfere with conversation. However, the prior art measuring system is sensitive to such variations, so that it returns unreliable values for signal quality when testing connectionless packet systemsxe2x80x94that is, the results do not accurately reflect the subjective quality reported by human subjects.
If the delay is constant, the two signals can easily be synchronised to take account of the delay. However, if the degraded signal suffers variable delay, at least some parts of the degraded signal would not be synchronised with the test signal. The lack of synchronisation in those parts would be detected as substantial errors, which would be so great as to mask any errors caused by actual degradation of the signal. This would lead to an inaccurate measure of the subjective effect of the degradation.
There is therefore a requirement for a measurement system that is robust against such variable delays.
According to the invention, there is provided apparatus for testing equipment for handling speech signals, comprising
means for receiving first and second signals, means for selecting individual sections in the first signal and second signal,
means for comparing each section in the second signal with the corresponding section in the first signal to generate a distortion perception measure which indicates the extent to which the distortion of said section would be perceptible to a human listener, and
means for combining the results of each such measurement to generate an overall measure of the extent to which the distortion of the second signal with respect to the first signal would be perceptible to a human listener.
Preferably, the overall measure takes account of the perceptual importance of each section. The perceptual importance of a given section will depend on the number of individual speech components, and their relative importance to subjective quality measures, in that section.
The means for selecting individual sections in the two signals may comprise means for identifying individual utterances. In the preferred embodiment this is achieved by detecting the end of each silent period in the signal. The apparatus preferably includes means for synchronising each section in the distorted signal with the corresponding section in the test signal. Synchronisation is preferably carried out by analysis of the speech content of the signals. However, a separate synchronisation characteristic may be used to identify the onset of each section. This synchronisation characteristic is preferably outside the frequency band characteristic of speech, so that it does not interfere with the analysis process (which only detects changes perceptible to a human listener). The synchronisation characteristic related to a given section may be chosen to be unique to that section, to ensure that each distorted section is compared with the corresponding test section. This ensures that, should a section, or its synchronisation characteristic, be lost as a result of the distortion, subsequent sections can nevertheless be analysed.
In a preferred arrangement, each section is analysed to identify the position of any delay change, and the parts of the section preceding and following any such delay change are separately synchronised, and analysed for degradation.
In the embodiment to be described in detail, the sections which are selected for analysis may comprise individual utterances, that is unbroken sections of speech each preceded and followed by silence of a minimum pre-determined length. However, a number of alternative methods may be used for defining suitable sections. For example, long utterances as previously defined may be sub-divided into two or more sub-utterances. The signal may instead be broken into a number of sections of fixed length or a fixed number of equal-length sections. However, if any sections contain no speech at all, they are preferably not used for analysis as delay is harder to determine. Any errors in non-information containing sections are less likely to be perceptually important.
In another aspect, the invention comprises a method of testing equipment for handling speech signals, comprising the steps of
supplying a test signal,
receiving a distorted signal which corresponds to said test signal when distorted by equipment under test,
selecting individual sections in the test signal and distorted signal, comparing each section in the distorted signal with the corresponding section in the test signal to generate a distortion perception measure which indicates the extent to which the distortion of said section would be perceptible to a human listener, and
combining the results of each such comparison to generate an overall measure of the extent to which the distortion of the signal would be perceptible to a human listener.
The invention may be embodied in computer software, as a computer program product for loading directly into the internal memory of a digital computer, comprising software code portions for performing the steps of the method described above when said product is run on a computer.
In a further aspect the invention comprises a computer program product stored on a computer usable medium, comprising:
computer-readable program means for causing a computer to select individual sections in a first signal and in a second signal,
computer-readable program means for causing the computer to analyse each section in the first signal and the second signal to generate a distortion perception measure which indicates the extent to which the distortion of said section in the second signal as compared with the first signal would be perceptible to a human listener, and
computer-readable program means for causing the computer to combine the results of each such comparison to generate an overall measure of the extent to which the distortion of the signal would be perceptible to a human listener.
The computer program product may be embodied on any suitable carrier readable by a suitable computer input device, such as CD-ROM, optically readable marks, magnetic media, punched card, or on an electromagnetic or optical signal.