Discontinuous transmission is used in mobile communication systems to switch the radio transmitter off during speech pauses. The use of DTX saves power in the mobile station and increases the time required between battery recharging. It also reduces the general interference level and thus improves transmission quality.
However, during speech pauses the background noise which is transmitted with the speech also disappears if the channel is cut off completely. The result is an unnatural sounding audio signal (silence) at the receiving end of the communication.
It is known in the art, instead of completely switching the transmission off during speech pauses, to generate parameters that characterize the background noise, and to send these parameters over the air interface at a low rate in Silence Descriptor (SID) frames. These parameters are used at the receive side to regenerate background noise which reflects, as well as possible, the spectral and temporal content of the background noise at the transmit side. These parameters that characterize the background noise are referred to as comfort noise (CN) parameters. The comfort noise parameters typically include a subset of speech coding parameters: in particular synthesis filter coefficients and gain parameters.
It should be noted, however, that in some comfort noise evaluation schemes of some speech codecs, part of the comfort noise parameters are derived from speech coding parameters while other comfort noise parameter(s) are derived from, for example, signals that are available in the speech coder but that are not transmitted over the air interface.
It is assumed in prior-art DTX systems that the excitation can be approximated sufficiently well by spectrally flat noise (i.e., white noise). In prior art DTX systems, the comfort noise is generated by feeding locally generated, spectrally flat noise through a speech coder synthesis filter. However, such white noise sequences are unable to produce high quality comfort noise. This is because the optimal excitation sequences are not spectrally flat, but may have spectral tilt or even a stronger deviation from flat spectral characteristics. Depending on the type of background noise, the spectra of the optimal excitation sequences may, for example, have lowpass or highpass characteristics. Because of this mismatch between the random excitation and the correct or optimal excitation the comfort noise generated at the receive side sounds different from the background noise on the transmit side. The generated comfort noise may, for example, sound considerably "brighter" or "darker" than it should be. During DTX, the spectral content of the background noise thus changes between active speech (i.e., speech coding on) and speech pauses (i.e., comfort noise generation on). This audible difference in the comfort noise thus causes a reduction in the transmission quality which can be perceived by a user.
In speech coding systems, such as in the full rate (FR), half rate (HR), and enhanced full rate (EFR) speech channels of the GSM system, the comfort noise parameters are transmitted at a low rate. By example, in the FR and EFR channels this rate is only once per every 24 frames (i.e., every 480 milliseconds). This means that comfort noise parameters are updated only about twice per second. This low transmission rate cannot accurately represent the spectral and temporal characteristics of the background noise and, therefore, some degradation in the quality of background noise is unavoidable during DTX.
A further problem that arises during DTX in digital cellular systems, such as GSM, relates to a hangover period of a few speech frames that is introduced after a speech burst, and before the actual transmission is terminated. If the speech burst is below some threshold duration, it can be interpreted as a background noise spike, and in this case the speech burst is not followed by a hangover period. The hangover period is used for computing an estimate of the characteristics of the background noise on the transmit side to be transmitted to the receive side in a comfort noise parameter message (or Silence Descriptor (SID) frame), before the transmission is terminated. As was described above, the transmitted background noise estimate is used on the receive side to generate comfort noise with characteristics similar to the transmit side background noise at the time the transmission is terminated.
In known types of DTX mechanisms similar to those of GSM FR and HR, non-predictive comfort noise quantization schemes are employed. Due to this, the receive side does not have to know if a hangover period exists at the end of a speech burst. However, in GSM EFR, efficient predictive comfort noise quantization schemes are employed, and the existence of a hangover period is locally evaluated at the receive side to assist in comfort noise dequantization. This involves a small computational load and a number of program instructions to be executed.
Another problem arises if the background noise on the transmit side is not stationary but varies considerably. In this case there may exist a single frame or a small number of frames within an averaging period for which some or all of the speech coding parameters provide a poor characterization of the typical background noise. A similar situation may occur when a Voice Activity Detection or VAD algorithm interprets the unvoiced end of the period of active speech as "no speech", or the stationary background noise contains strong impulse-type noise bursts. Because of the short duration of the averaging periods in known types of DTX systems such ill-conditioned speech coding parameters may change the result of the averaging significantly enough that the resulting averaged CN parameters do not accurately characterize the background noise. This results in a mismatch either in the level or in the spectrum, or both, between the background noise and the comfort noise. The quality of transmission is thus impaired as the background noise sounds different to the user depending on whether it is received during speech (normal speech coding of speech and background noise) or during speech pauses (produced by comfort noise generation).
In greater detail, during the DTX hangover period any frames declared by the VAD algorithm as being "no speech" frames are sent over the air interface, and the speech coding parameters are buffered to be able to evaluate the comfort noise parameters for a first SID frame. The first SID frame is transmitted immediately after the end of the DTX hangover period. The length of the DTX hangover period is thus determined by the length of the averaging period. Therefore, to minimize the channel activity of the system, the averaging period should be fixed at a relatively short length.
Before describing the present invention, it will be instructive to review conventional circuitry and methods for generating comfort noise parameters on the transmit side, and for generating comfort noise on the receive side. In this regard reference is thus first made to FIGS. 1a-1d.
Referring to FIG. 1a, short term spectral parameters 102 are calculated from a speech signal 100 in a Linear Predictive Coding (LPC) analysis block 101. LPC is a method well known in the prior art. For simplicity, discussed herein is only the case where the synthesis filter has only a short term synthesis filter, it being realized that in most prior art systems, such as in GSM FR, HR and EFR coders, the synthesis filter is constructed as a cascade of a short term synthesis filter and a long term synthesis filter. However, for the purposes of this description a discussion of the long term synthesis filter is not necessary. Furthermore, the long term synthesis filter is typically switched off during comfort noise generation in prior art DTX systems.
The LPC analysis produces a set of short term spectral parameters 102 once for each transmission frame. The frame duration depends on the system. For example, in all GSM channels the frame size is set at 20 milliseconds.
The speech signal is fed through an inverse filter 103 to produce a residual signal 104. The inverse filter is of the form: ##EQU1##
The filter coefficients a(i), i=1, . . , M are produced in the LPC analysis and are updated once for each frame. Interpolation as is known in prior art speech coding may be applied in the inverse filter 103 to obtain a smooth change in the filter parameters between frames. The inverse filter 103 produces the residual 104 which is the optimal excitation signal, and which generates the exact speech signal 100 when fed through synthesis filter 1/A(z) 112 on the receive side (see FIG. 1b). The energy of the excitation sequence is measured and a scaling gain 106 is calculated for each transmission frame in excitation gain calculation block 105.
The excitation gain 106 and short term spectral coefficients 102 are averaged over several transmission frames to obtain a characterization of the average spectral and temporal content of the background noise. The averaging is typically carried out over four frames for the GSM FR channel to eight frames, as is the case for the GSM EFR channel. The parameters to be averaged are buffered for the duration of the averaging period in blocks 107a and 108a (see FIG. 1d). The averaging process is carried out in blocks 107 and 108, and the average parameters that characterize the background noise are thus generated. These are the average excitation gain g.sub.mean and the average short term spectral coefficients. In modern speech codecs, there are typically 10 short term spectral coefficients (M=10) which are usually represented as Line Spectral Pair (LSP) coefficients f.sub.mean (i), i=1, . . . M, as in the GSM EFR DTX system. Although these parameters are typically quantized prior to transmission, the quantization is ignored in this description for simplicity, in that the exact type of quantization that is performed is irrelevant to an understanding of the operation of the invention as described below.
Referring briefly to FIG. 1d, it is shown that the averaging blocks 107 and 108 each typically include the respective buffers 107a and 108a, which output buffered signals 107b and 108b, respectively, to the averaging blocks. Greater attention will be paid to the buffers 107a and 108a below when describing the embodiments of the invention shown in FIGS. 4 and 5.
The computation and averaging of the comfort noise parameters is explained in detail in GSM recommendation: GSM 06.62 "Comfort noise aspects for Enhanced Full Rate (EFR) speech traffic channels". Also by example, discontinuous transmission is explained in GSM recommendation: GSM 06.81 "Discontinuous Transmission (DTX) for Enhanced Full Rate (EFR) for speech traffic channels", and voice activity detection (VAD) is explained in GSM recommendation: GSM 06.82 "Voice Activity Detection (VAD) for Enhanced Full Rate (EFR) speech channels". As such, the details of these various functions are not further discussed here.
Referring to FIG. 1b, there is shown a block diagram of a conventional decoder on the receive side that is used to generate comfort noise in the prior art speech communication system. The decoder receives the two comfort noise parameters, the average excitation gain g.sub.mean and the set of average short term spectral coefficients f.sub.mean (i), i=1, . . . M, and based on the parameters the decoder generates the comfort noise. The comfort noise generation operation on the receive side is similar to speech decoding, except that the parameters are used at a significantly lower rate (e.g., once every 480 milliseconds, as in the GSM FR and EFR channels), and no excitation signal is received from the speech encoder. During speech decoding the excitation on the receive side is obtained from a codebook that contains a plurality of possible excitation sequences, and an index for the particular excitation vector in the codebook is transmitted along with the other speech coding parameters. For a detailed description of speech decoding and the use of codebooks reference can be had to, by example, U.S. Pat. No.: 5,327,519, entitled "Pulse Pattern Excited Linear Prediction Voice Coder", by Jari Hagqvist, Kari Jarvinen, Kari-Pekka Estola, and Jukka Ranta, the disclosure of which is incorporated by reference herein in its entirety.
During comfort noise generation, however, no index to the codebook is transmitted, and the excitation is obtained instead from a random number or excitation (RE) generator 110. The RE generator 110 generates excitation vectors 114 having a flat spectrum. The excitation vectors 114 are then scaled by the average excitation gain g.sub.mean in scaling unit 115 so that their energy corresponds to the average gain of the excitation 104 on the transmit side. A resulting scaled random excitation sequence 111 is then input to the speech synthesis filter 112 to generate the comfort noise output signal 113. The average short term spectral coefficients f.sub.mean (i) are used in the speech synthesis filter 112.
FIG. 1c illustrates the spectrum associated with the signal in different parts of the prior art decoder of FIG. 1b. The RE-generator 110 produces the random number excitation sequences 114 (and the scaled excitation 111) having a flat spectrum. This spectrum is shown by curve A. The speech synthesis filter 112 then modifies the excitation to produce a non-flat spectrum as shown in curve B.
As was discussed above, a number of problems exist with respect to conventional comfort noise generation techniques. These problems include the mismatch between the random excitation and the correct or optimal excitation which results in the comfort noise generated at the receive side sounding different from the actual background noise on the transmit side. It is a goal of this invention to reduce or eliminate these problems.