The invention relates to a method for making a machine-aided assessment of the transmission quality of audio signals, in particular of speech signals, spectra of a source signal to be transmitted and of a transmitted reception signal being determined in a frequency domain.
The assessment of the transmission quality of speech channels is gaining increasing importance with the growing proliferation and geographical coverage of mobile radio telephony. There is a desire for a method which is objective (i.e. not dependent on the judgment of a specific individual) and can run automatically.
Perfect transmission of speech via a telecommunications channel in the standardized 0.3-3.4 kHz frequency band gives about 98% sentence comprehension. However, the introduction of digital mobile radio networks with speech coders in the terminals can greatly impair the comprehensibility of speech. Moreover, determining the extent of the impairment presents certain difficulties.
Speech quality is a vague term compared, for example, with bit rate, echo or volume. Since customer satisfaction can be measured directly according to how well the speech is transmitted, coding methods need to be selected and optimized in relation to their speech quality. In order to assess a speech coding method, it is customary to carry out very elaborate auditory tests. The results are in this case far from reproducible and depend on the motivation of the test listeners. It is therefore desirable to have a hardware replacement which, by suitable physical measurements, measures the speech performance features which correlate as well as possible with subjectively obtained results (Mean Opinion Score, MOS).
EP 0 644 674 A2 discloses a method for assessing the transmission quality of a speech transmission path which makes it possible, at an automatic level, to obtain an assessment which correlates strongly with human perception. This means that the system can make an evaluation of the transmission quality and apply a scale as it would be used by a trained test listener. The key idea consists in using a neural network. The latter is trained using a speech sample. The end effect is that integral quality assessment takes place. The reasons for the loss of quality are not addressed.
Modern speech coding methods perform data compression and use very low bit rates. For this reason, simple known objective methods, such as for example the signal-to-noise ratio (SNR), fail.
The object of the invention is to provide a method of the type mentioned at the start, which makes it possible to obtain an objective assessment (speech quality prediction) while taking the human auditory process into account.
The way in which the object is achieved is defined by the features of claim 1. According to the invention, in order to assess the transmission quality a spectral similarity value is determined which is based on calculation of the covariance of the spectra of the source signal and reception signal and division of the covariance by the standard deviations of the two said spectra.
Tests with a range of graded speech samples and the associated auditory judgment (MOS) have shown that a very good correlation with the auditory values can be obtained on the basis of the method according to the invention. Compared with the known procedure based on a neural network, the present method has the following advantages:
Less demand on storage and CPU resources. This is important for real-time implementation.
No elaborate system training for using new speech samples.
No suboptimal reference inherent in the system. The best speech quality which can be measured using this measure corresponds to that of the speech sample.
Preferably, the spectral similarity value is weighted with a factor which, as a function of the ratio between the energies of the spectra of the reception and source signals, reduces the similarity value to a greater extent when the energy of the reception signal is greater than the energy of the source signal than when the energy of the reception signal is lower than that of the source signal. In this way, extra signal content in the reception signal is more negatively weighted than missing signal content.
According to a particularly preferred embodiment, the weighting factor is also dependent on the signal energy of the reception signal. For any ratio of the energies of the spectra of reception to source signal, the similarity value is reduced commensurately to a greater extent the higher the signal energy of the reception signal is. As a result, the effect of interference in the reception signal on the similarity value is controlled as a function of the energy of the reception signal. To that end, at least two level windows are defined, one below a predetermined threshold and one above this threshold. Preferably, a plurality of, in particular three, level windows are defined above the threshold. The similarity value is reduced according to the level window in which the reception signal lies. The higher the level, the greater the reduction.
The invention can in principle be used for any audio signals. If the audio signals contain inactive phases (as is typically the case with speech signals) it is recommendable to perform the quality evaluation separately for active and inactive phases. Signal segments whose energy exceeds the predetermined threshold are assigned to the active phase, and the other segments are classified as pauses (inactive phases). The spectral similarity described above is then calculated only for the active phases.
For the inactive phases (e.g. speech pauses) a quality function can be used which falls off degressively as a function of the pause energy:   A            log      ⁢              xe2x80x83            ⁢      10      ⁢              (        Epa        )                    log      ⁢              xe2x80x83            ⁢      10      ⁢              (                  E          ⁢                      xe2x80x83                    ⁢          max                )            
A is a suitably selected constant, and Emax is the greatest possible value of the pause energy.
The overall quality of the transmission (that is to say the actual transmission quality) is given by a weighted linear combination of the qualities of the active and of the inactive phases. The weighting factors depend in this case on the proportion of the total signal which the active phase represents, and specifically in a non-linear way which favours the active phase. With a proportion of e.g. 50%, the quality of the active phase may be of the order of e.g. 90%.
Pauses or interference in the pauses are thus taken into account separately and to a lesser extent than active signal pauses. This accounts for the fact that essentially no information is transmitted in pauses, but that it is nevertheless perceived as unpleasant if interference occurs in the pauses.
According to an especially preferred embodiment, the time-domain sampled values of the source and reception signals are combined in data frames which overlap one another by from a few milliseconds to a few dozen milliseconds (e.g. 16 ms). This overlap formsxe2x80x94at least partiallyxe2x80x94the time masking inherent in the human auditory system.
A substantially realistic reproduction of the time masking is obtained if, in additionxe2x80x94after the transformation to the frequency domainxe2x80x94the spectrum of the current frame has the attenuated spectrum of the preceding one added to it. The spectral components are in this case preferably weighted differently. Low frequency components in the preceding frame are weighted more strongly than ones with higher frequency.
It is recommendable to carry out compression of the spectral components before performing the time masking, by exponentiating them with a value xcex1 less than 1 (e.g. xcex1=0.3). This is because if a plurality of frequencies occur at the same time in a frequency band, an over-reaction takes place in the auditory system, i.e. the total volume is perceived as greater than that of the sum of the individual frequencies. As an end effect, it means compressing the components.
A further measure for obtaining a good correlation between the assessment results of the method according to the invention and subjective human perception consists in convoluting the spectrum of a frame with an asymmetric xe2x80x9csmearing functionxe2x80x9d. This mathematical operation is applied both to the source signal and to the reception signal and before the similarity is determined.
The smearing function is, in a frequency/loudness diagram, preferably a triangle function whose left edge is steeper than its right edge.
Before the convolution, the spectra may additionally be expanded by exponentiation with a value xcex5 greater than 1 (e.g. xcex5=4/3). The loudness function characteristic of the human ear is thereby simulated.
The detailed description below and the set of patent claims will give further advantageous embodiments and combinations of features of the invention.