The present invention relates to a method for assessing background noise during speech pauses of recorded or transmitted speech signals.
The perceived speech quality, for example, in telephone connections or radio transmissions, is chiefly determined by speech-simultaneous interference, that is, by interference during speech activity. However, noise during the speech pauses goes into the quality decision as well, in particular in the case of high-quality speech reproduction.
The intensity of the background noise during the speech pauses can be used as a supplementary characteristic for determining the speech quality.
Speech quality evaluations of speech signals are generally carried out by listening (“subjective”) tests with test subjects.
On the other hand, the goal of instrumental (“objective”) methods for determining speech quality is to determine characteristics which describe the speech quality of the speech signal from properties of the speech signal to be assessed, using suitable calculation methods without having to draw on the judgements of test subjects.
A reliable quality assessment is provided by instrumental methods which are based on a comparison of the undisturbed reference speech signal (source speech signal) and the disturbed speech signal at the end of the transmission chain. There are many such methods, which are mostly employed in so-called “test connection systems”. In this context, the undisturbed source speech signal is injected at the source and recorded after transmission.
Known methods for determining the intensity of background noise usually start from the disturbed signal itself and use a determined intensity threshold to distinguish active speech and speech pauses (FIG. 1). In the simplest case, this threshold is set to be constant in the method, but can also be adapted on the basis of the signal pattern (for example, a defined distance from the signal peak value). The goal is a reliable distinction between speech and speech pause. If the distinction is achieved, the sought intensity characteristics of the background noise can be determined from the signal segments that have been identified as a speech pause. To this end, the signal segments that have been identified as a speech pause are generally further divided into shorter segments (typically 8 . . . 40 ms) and the intensity calculations (for example, effective value or loudness) are carried out for these shorter segments. Then, intensity characteristics can be determined from the results.
Given low noise intensities during speech pauses and, at the same time, high speech intensity (high speech-to-noise ratio), these methods yield reliable measured values because a reliable distinction can be made between speech and speech pause (FIG. 1).
In the case of increasing noise intensities during speech pauses (decreasing speech-to-noise ratio), increasingly uncertainties arise in the distinction between speech and speech pauses. Here, it is difficult to fix the threshold value in such a manner that, on one hand, no noise segments with higher intensities than speech are detected (threshold too low) and, on the other hand, no speech segments of lower intensity are judged as a speech pause (threshold too high) (FIG. 2).
If the intensity of the noise during the speech pauses reaches or even exceeds the intensity of the active speech, no intensity threshold can be found that would permit a distinction between speech and speech pause.
Solutions to the described problems are possible if, for example, speech and background noise have different spectral characteristics. By appropriately prefiltering the signal or via spectral analysis and evaluation of selected frequency bands, it is possible here to achieve a higher speech-to-background noise ratio in the observed frequency bands, making a reliable distinction between speech and speech pause possible again.
Other solutions make use of certain parameters, which are determined in speech coding, and use them to distinguish between speech and segments containing background noise. In this context, the goal is to derive from the parameters whether the observed signal segment has typical properties of speech (for example, voiced portions). An example of this is the “Voice-Activity Detector” (ETSI Recommendation GSM 06.92, Valboune, 1989).
In the case of low speech-to-noise ratios, these methods work more ruggedly and are primarily used to suppress the transmission of speech pauses, for example, in mobile radio communications. However, the methods show uncertainties when the background noise itself contains speech or is similar to speech. Such segments are then classified as speech although they are perceived by a listener as disturbing background noise.
Instrumental speech quality measurement methods are usually based on the principle of signal comparison of the undisturbed reference speech signal and the disturbed signal to be assessed. Examples of this include the publications:
“A perceptual speech-quality measure based on a psychacoustic sound representation” (Beerends. J. G.: Stemerdink, J. A., J. Audio Eng. Soc. 42 (1994) 3, p. 115-123).
“Auditory distortion measure for speech coding” (Wang, S; Sekey, A.; Gersho, A.: IEEE Proc. Int. Conf. acoust., speech and signal processing (1991), p. 493-496).
Such a method is also described in the ITU-T standard P.861 currently in force: “Objective quality measurement of telephone-band speech codecs” (ITU-T Rec. P.861, Geneva 1996).
Such measurement methods are employed in so-called “test connection systems”, in which a knot, reference speech signal (source speech signal) is injected at the source, transmitted, for example, via a telephone connection, and recorded at the sink. Subsequent to recording the speech signal, its properties are compared to those of the undisturbed source speech signal to assess the speech quality of the possibly disturbed speech signal.
If the undisturbed source speech signal is available to determine the background noise during speech pauses, then this signal can be used to determine the transition moments from speech to speech pause or from speech pause to speech, respectively. To this end, for example, a method with threshold value determination, as described above, is applied to the source speech signal. The method provides reliable distinctions between speech and speech pause because the speech-to-noise ratio in the undisturbed source speech signal is sufficiently high (FIG. 3a). The moments of threshold passage, that is, beginning and end of speech activity can now be transferred to the disturbed speech signal (FIG. 3b).
Such a method can be modified without problems if a constant time lag (for example, a delay due to signal transmission) occurs between the source speech signal and the disturbed signal. However, the condition is that this time lag can be reliably determined in advance and that it is then used to correct the end or beginning points of speech activity. This is mostly possible in the case of time-invariant systems because these have a constant delay (FIG. 3c)
In principle such a method works also if the time offset between the two signals is not constant for the entire signal length but is variable. These time-invariant systems include, in particular, packet-based transmission systems where marked fluctuations in the system delay can occur due to different packet transit times and a corresponding starting points management in the receiver. To prevent losses due to packets that arrive late, sometimes speech pauses are extended and later ones are shortened in the receiver. Starting or end points of speech activity can then only be transmitted if the current delay at these points is known. The adaptive determination of the time offset is computing-time intensive and frequently only inadequately achieved, especially in the case of reduced speech-to-noise ratios. If the adaptive determination of the time offset is not achieved reliably then the beginning and the end of speech pauses cannot be determined exactly or not at all. Because of this, the intensity characteristics of noise during pauses cannot or only unreliably be determined.
As described, it is difficult or sometimes impossible to determine background noise during speech pauses even if the undisturbed source speech signal is known, especially when                a low speech-to-background noise ratio exists,        the background noise contains speech or is similar to speech itself,        the time offset between the undisturbed source speech signal and the disturbed speech signal is not constant over the entire signal length.        
The known methods are based on determining the starting and end points of a speech pause as accurately as possible. As a result, the signal of the pause segments is then available for further evaluation. The intensity characteristics are determined from these separated pause segments.