1. Field of the Invention
The present invention relates to auditory tests for evaluating the quality of encoded voice or speech, respectively, and audio signals or for evaluating the quality of a telephone connection, like for example a wire-bonded or wireless telephone connection. In particular, the present invention relates to the provisioning of test signal sections for performing so-called subjective and/or objective measurements for quality assessment.
2. Description of the Related Art
For evaluation of the quality of encoded voice and audio signals in measurement technology, today standardized perception-based measurement methods (perceptual measurements) are used. Known methods are the so-called PESQ method (PESQ=perceptual evaluation of speech quality) described in the standardization document ITU-T P.862 (02/2001). Another known measurement method for quality assessment is the so-called PEAQ method (PEAQ=objective measurements of perceived audio quality) and is illustrated in the standardization document Rec. ITU-R BS. 1387-1 (1998-2001). These methods or further methods for quality assessment, respectively, have in common that a signal to be tested (“test signal”) which is in general the output signal of a system or network or generally of an element to be tested (DUT) is compared to an original or also reference signal which is in general the input signal into the DUT to be tested.
Such a general setting is illustrated in FIG. 6. The original audio signal which is fed into a DUT 600 here represents the reference signal or input signal, while the output signal behind the DUT 600 is used either to perform a subjective auditory test with test persons, as it is indicated by a subject 602, or a quality assessment method, like for example to perform PESQ or PEAQ, as it is illustrated by a model 604. By feeding the output signal from the DUT 600 to the subject 602, thus a subjective auditory test may be performed which is typically performed with several test persons in standardized rooms. By feeding the original audio signal before the DUT 600, i.e. the reference signal, and the audio signal distorted by the DUT to the model 604, an objective test, i.e. an algorithmic evaluation without subjective test persons, may be performed.
The DUT 600 is typically a system whose influence on the auditory quality is to be evaluated. Such a system is, for example, a telecommunications connection and in particular a telephone connection which may be wireless or wire-bonded. An alternative DUT 600 is, for example, an encoder/decoder path, in order to assess the quality impairment of an encoding concept having a downstream decoding concept. The output of the model, when the model operates in the intended way, is to be a prediction of the perceived quality which test persons would subjectively indicate on a scale when they hear the output signal of the DUT 600.
In the PESQ method, for example, the original audio signal, i.e. the audio signal before the DUT 600, which is the reference signal, is compared to the audio signal distorted by the DUT 600 considering a time delay, wherein a psycho-acoustic model is used. In particular, both the original audio signal before the DUT 600 and also the distorted audio signal after the DUT 600 are converted into a so-called internal representation which is analog to the psycho-physical representation of audio signals in the human auditory system, wherein in particular parameters like the bark scale and sone are considered, as it known in the art. The internal psycho-physical representation of the original audio signal is then compared to the internal psycho-physical representation of the distorted audio signal in order to calculate one or several error parameters, depending on the model, which allow a quantitative quality indication.
A quality assessment method illustrated with reference to FIG. 6 is also referred to as an “intrusive” method, as feeding in the reference signal, i.e. the original audio signal, into the system to be tested (DUT 600) is necessary. At the output of the DUT, as has been indicated, the test signal to be evaluated is obtained which is also referred to as the distorted audio signal or in general as the audio signal, respectively, in FIG. 6. The output of the DUT 600 may, for example, be the remote end of a telephone connection of two subscribers, wherein the original audio signal is fed in at the close end as the reference signal. In this case, the measurement method, like for example PESQ, would characterize the quality of speech of a telephone connection.
As it has been explained, the algorithmic measurement methods are based on a combination of psycho-acoustic and cognitive findings about human auditory perception. The basic experiment of those methods mainly is that a subjective auditory test is performed in which a statistically sufficient number of test listeners (subjects) is presented with a series of voice (speech) or audio sequences, respectively, for assessment. The testers assess those sequences using a discrete or continuous quality scale, respectively, also referred to as “opinion scale” and for example ranging from 1 (“bad”) to 5 (“excellent”). Such subjective auditory tests are, for example, represented in the standardization document ITU-T P.800 (08/1996).
It has been found that real test persons can only qualitatively evaluate short sequences. If the test persons are presented a longer sequence, i.e. a longer test signal section, then so to speak a “statistical averaging” takes place. In other words, the cognitive process of forgetting of heard interferences leads to a corruption of the statements of the test persons, wherein this corruption is inherent in a system due to the fact that the test persons are human.
Consequently, thus, in standardized test processes, like for example in the standardization document Rec. ITU-R BS.1116-1 or Rec. ITU-R BS.1534, test sequences are mandatory having a duration of typically between 8 and 12 seconds, whose maximum length does not exceed 20 seconds, however. Although these test sequences are real signals, they do not, however, stochastically or randomly come from a real scenario, respectively, but are standardized predetermined test sequences that may be fed into the DUT to be observed in an experiment in order to obtain the test input signal, i.e. the audio signal distorted through the DUT.
In recent times, developments have been presented which also allow performing non-intrusive tests which are to facilitate an estimation of the speech quality merely based on an analysis of the test signal on the receive side, i.e. without feeding in a reference signal on the transmit side. Such developments are of special advantage for practical realizations, as they allow, for example, an indication of the speech quality of a mobile radio connection only in the terminal device without any measurement technology arrangements or preconditions and/or manipulations of any kind in the telephone network being required, so to speak, for feeding in a reference signal. It should be possible to subject every real telephone conversation to such a non-intrusive concept of a quality assessment.
This new non-intrusive concept is currently being developed. It is assumed that, for reasons of comparability with intrusive measurement concepts, test sequence lengths will be mandatory also for the non-intrusive measurement concept, which are similar to the test sequence lengths from the intrusive tests, i.e. which are selected such that for the test listener no so-called “statistical averaging” or forgetting of an error occurs due to a sequence which is too long, and which are on be made. As it has already been indicated, the duration of the test sequences is typically between 8 to 12 seconds, whereas sometimes also test sequences, i.e. test signal sections, with 20 seconds at maximum are admitted.
In particular with non-intrusive quality assessments of a distorted audio signal or in the assessment of an influence of, for example, a transmission channel 600 in FIG. 6 to the audio signal, respectively, working with predefined test signal sections is not easily possible any more. Instead, real audio signals have to be used for quality assessment. Nevertheless, a comparability of the measurement results is to be guaranteed, as this is a main advantage of standardized quality assessment methods, i.e. that the results of different tests should be comparable.
In the following, with reference to FIG. 5, the thus resulting problem is illustrated. FIG. 5 shows a time diagram of a signal transmitted via a telephone connection, i.e. an audio signal which was distorted by the transmission via a telephone connection. In the time diagram of FIG. 5, along the ordinate a normalized amplitude is plotted, while along the abscissa the time t is plotted. The signal illustrated in FIG. 5 clearly shows the characteristic of a voice signal in so far that, on the one hand, information-carrying sections, like for example the section between one second and nine seconds, are present and that the information-carrying sections are separated from each other by non-information-carrying sections, also referred to as pauses. The non-information-carrying section following after the first information-carrying section extends from about 9 seconds to about 10.8 seconds. Then again a longer information-carrying section from 10.8 seconds to about 20.2 seconds follows. After this second information-carrying section again a pause between 20.3 seconds approximately and 21.3 seconds follows. After the second pause again an information-carrying section follows extending approximately to 23.7 seconds, whereupon again a pause follows.
The simplest possibility for extracting test signal sections would be to break down the audio signal illustrated in FIG. 5 into adjacent sections of equal length. A kind of fragmenting in order to obtain test signal sections having a duration of about 10 seconds is illustrated by b(1), b(2) etc. Another way of fragmenting the audio signal illustrated in FIG. 5 to obtain test signal sections having a duration of, for example, 7.5 seconds, is illustrated by a(1), a(2), a(3), etc.
The fragmentation of the audio signal into sections of a constant length is problematic in so far that it may no longer be calculated how large the information-carrying section in a test signal section is and how large the non-information-carrying section in a test signal section is, i.e. how large the weighting of information/pause is. In addition to that, it may be the case in particular in telephone conversations that longer pauses occur between the conversation partners. This would lead to the fact that a test signal section would, for example, only consist of a pause. It may easily be seen that, only based on a pause, no quality assessment is possible.
The procedure illustrated in FIG. 5 is only “good” when each telephone conversation is, for example, always shorter than 20 seconds, so that the complete telephone conversation may be taken as a test signal section. If this is not the case, then a breaking down into constant time sections, as it was illustrated in FIG. 5, does not result in a comparability with a subjective auditory test result. In addition to that, the measurement periods of different duration will at least lead to different, maybe even useless results. In particular for the measurement in mobile telephone networks from a driving car using so-called “drive test tools”, a measurement duration as short as possible or the fragmentation of real test conversations into shorter time intervals or measurement periods, respectively, is desired, as it is indicated by a(1), a(2), a(3) in FIG. 5. These shorter measurement durations are particularly desirable in mobile radio networks in order to correlate the measurement periods with geographical data, in order to obtain a geographically detailed statement in the quality of a mobile radio system.
As already indicated above, FIG. 5 shows the graphical illustration of the time signal of a voice signal, gained from a real telephone conversation. The voice-activity modulation parts, i.e. the information-carrying sections of the signal, here spoken sentences, as well as the pauses of voice in between, i.e. the non-information-carrying sections, may easily be seen. It is to be noted that on the listener side of the one end of the current communication the signal indicated in FIG. 5 was recorded. As it was explained, substantially longer pauses in which the opposing person talks occur in a conversation. These are omitted for clarity in FIG. 5.
In FIG. 5 two possible fragmentations are illustrated based on a division into fixed time sections. It may clearly be seen that a time section may begin (a(2), b(2)) or end (a(1), a(2), . . . , b(1)) within the modulation, i.e. of a word or sentence.
In addition to that it may also happen and will particularly be the case in a dialog, that a test signal section may mainly or completely consist of a pause, as it may, for example, partially be seen with reference to the test signal section a(2) which consists to one third of a pause.
The partitioning into fixed time sections of an audio signal to be assessed thus does not meet the requirements of sequences suitable for an auditory test, i.e. voiced examples typically having two sets of a maximum duration of 20 seconds. It is further desired that such sequences suitable for an auditory test ideally start with pauses, end with pauses and are in particular also separated by pauses when subsequent test signal sections are regarded.
In addition to that, the “hard” switching on and off in modulation parts, like, for example, the hard switching off of the information-carrying section in the test signal section a(1), leads to interference noise which may also be referred to as spectral interference noise or “crackle”. In signal theory, the hard clipping of a modulation part indicates the convolution of the signal using a jump function. This interference noise or artefacts, respectively, would be evaluated as an interference by a measurement method, which would directly lead to the fact that, for example, a communication connection is assessed to be worse than it actually is.