The digital transmission of speech occurs in many applications including numerous telephone applications. In telephone applications such as mobile communication systems, low power consumption is crucial to longer battery life-time and, consequently, to better performance. In cellular telephones, for example, by switching off the transmitter between bursts of speech, power can be conserved. In an end-to-end telephone conversation, each user typically speaks about 40-60% of the time. Between these bursts of speech, the transmitter is simply being used to send background noise to the receiver.
By efficiently detecting voice activity, switching off the transmitter when no voice is present, and using a perceptually acceptable method of filling in the gaps between the speech bursts, the lifetime of the battery can be approximately doubled at little additional cost. This technique, known as discontinuous transmission, also eases packet traffic in typical Code-Division Multiple Access (CDMA) and Time Division Multiple Access (TDMA) communication systems, allowing more subscribers to use the network with less interference. FIG. 1 shows a exemplary vocoder 10 used in such communication systems. The vocoder 10 includes an encoder 12 which processes data for transmission over output channel 16 and a decoder 14 which processes incoming communications from input channel 18.
The encoder 12 is shown in more detail in FIG. 2. The exemplary encoder 12 shown in FIG. 2 includes a control module 20, a voice activity detector (VAD) 22, a speech parameter generator 12 and a noise parameter generator 26. The decoder 14 is shown in more detail in FIG. 3 and includes a control module 30, a speech parameter detector 32, a speech generator 34 and a comfort noise generator 36.
An important component in the encoder 12 of a discontinuous transmission system is the VAD 22 which detects pauses in speech so that no transmission of data occurs during periods of no voice activity. The VAD 22 must be able to detect the absence of speech in a signal, as much as possible, while not mis-classifying speech as noise even in poor Signal-To-Noise (SNR) conditions. A primary problem, however with systems which use the VAD 22 is clipping of initial parts of the detected speech. This occurs in part because speech transmission is not resumed until after speech activity has been detected. Another problem is the lack of background noise during inactivity which would normally occur in a continuous transmission system.
In an attempt to improve the quality of synthesized speech generated by the speech generator 34 in systems which use the VAD 22 to reduce data transmissions, synthesized comfort noise, generated by the comfort noise generator 36, is added during the decoding process performed by the decoder 18 to fill in the gaps between the bursts of speech. The synthesized comfort noise, however, does not model actual background noise experienced at the encoder 12 thus, any quality improvements are minimal.
Some techniques to capture and inform the speech decoder 18 of the actual nature of the background noise have been proposed in the prior art.
In typical speech compression schemes like Code-Excited Linear Prediction (CELP) [see M. R. Schroeder and B. S. Atal, "Code-excited linear prediction (CELP): High quality speech at very low bit rates", Proc. Inter. Conf. Acoust., Speech, Signal Processing, 1985, pp. 937-940, vol. 1.], the digitally sampled input speech received through input channel 16 is divided into non-overlapping frames for the purpose of analysis. The VAD 22 then classifies each frame as being either speech or noise.
To synthetically generate a noise similar to the background noise, a common approach in such systems is to then capture the statistics of this noise and to generate a statistically similar pseudo-random noise at the decoder 30. A common model for background noise is a low-order auto-regressive process. An advantage of this model is its similarity to the model often used for regular speech. This similarity allows the use of similar quantization schemes to compress the short-term parameters of both noise and speech in the noise parameter generator 26 and in the speech parameter generator 24, respectively. The auto-regressive model can then be deduced from the short-term auto-correlation values of the noise process.
In many discontinuous transmission schemes, the first few frames classified as noise are re-classified as "noise-analysis frames." During these frames, the noise is coded as regular speech, however, the auto-correlation values computed during the analysis of these frames are averaged to compute the auto-correlation of the noise. If more noise frames follow the noise analysis frames, these auto-correlation values are used to infer the decoder 18 before the transmitter is switched off.
This approach has been used by the Groupe Speciale Mobile (GSM) of the European Telecommunications Standards Institute (ESTI) in both the full-rate [see European Telecommunications Standards Institute (ESTI), European Digital Cellular Telecommunication System (Phase 2); Voice Activity Detection (VAD) (GSM 06.32)] and the half-rate [see European Telecommunications Standards Institute (ESTI), European Digital Cellular Telecommunication System; Half-rate Speech Part 6: Voice Activity Detection (VAD) for half rate speech traffic channels (GSM 06.42)] standards.
The VAD 22 which distinguishes noise from speech, however, is usually inaccurate and, furthermore, it is reasonable to expect the first few noise analysis frames to contain a few milli-seconds of speech. Thus, by uniformly averaging, the auto-correlation parameters obtained do not accurately represent the statistics of the actual background noise. The result is often annoying noise between bursts of speech.
Further, in typical discontinuous transmission schemes, the decoder 14 fills in the gaps between speech bursts by simply creating an auto-regressive noise whose statistics match those of background noise. This approach is used in both the GSM full-rate [see European Telecommunications Standards Institute (ESTI), European Digital Cellular Telecommunication System; (Phase 2) Part 4: Comfort Noise aspects for the full rate speech traffic channel (GSM 06.12)] and half-rate [see European Telecommunications Standards Institute (ESTI), European Digital Cellular Telecommunication System; Comfort Noise aspects for the half-rate speech traffic channels (GSM 06.22)] standards. This results in noise bursts which do not smoothly blend in with the background noise present when the speakers are active.