This invention relates generally to the field of speech communication and, more particularly, to discontinuous transmission (DTX) and to improving the quality of comfort noise (CN) during discontinuous transmission.
Discontinuous transmission is used in mobile communication systems to switch the radio transmitter off during speech pauses. The use of DTX saves power in the mobile station and increases the time required between battery recharging. It also reduces the general interference level and thus improves transmission quality.
However, during speech pauses the background noise which is transmitted with the speech also disappears if the channel is cut off completely. The result is an unnatural sounding audio signal (silence) at the receiving end of the communication.
It is known in the art, instead of completely switching the transmission off during speech pauses, to generate parameters that characterize the background noise, and to send these parameters over the air interface at a low rate in Silence Descriptor (SID) frames. These parameters are used at the receive side to regenerate background noise which reflects, as well as possible, the spectral and temporal content of the background noise at the transmit side. These parameters that characterize the background noise are referred to as comfort noise (CN) parameters. The comfort noise parameters typically include a subset of speech coding parameters: in particular synthesis filter coefficients and gain parameters.
It should be noted, however, that in some comfort noise evaluation schemes of some speech codecs, part of the comfort noise parameters are derived from speech coding parameters while other comfort noise parameter(s) are derived from, for example, signals that are available in the speech coder but that are not transmitted over the air interface.
It is assumed in prior-art DTX systems that the excitation can be approximated sufficiently well by spectrally flat noise (i.e., white noise). In prior art DTX systems, the comfort noise is generated by feeding locally generated, spectrally flat noise through a speech coder synthesis filter. However, such white noise sequences are unable to produce high quality comfort noise. This is because the optimal excitation sequences are not spectrally flat, but may have spectral tilt or even a stronger deviation from flat spectral characteristics. Depending on the type of background noise, the spectra of the optimal excitation sequences may, for example, have lowpass or highpass characteristics. Because of this mismatch between the random excitation and the correct or optimal excitation the comfort noise generated at the receive side sounds different from the background noise on the transmit side. The generated comfort noise may, for example, sound considerably xe2x80x9cbrighterxe2x80x9d or xe2x80x9cdarkerxe2x80x9d than it should be. During DTX, the spectral content of the background noise thus changes between active speech (i.e., speech coding on) and speech pauses (i.e., comfort noise generation on). This audible difference in the comfort noise thus causes a reduction in the transmission quality which can be perceived by a user.
In speech coding systems, such as in the full rate (FR), half rate (HR), and enhanced full rate (EFR) speech channels of the GSM system, the comfort noise parameters are transmitted at a low rate. By example, in the FR and EFR channels this rate is only once per every 24 frames (i.e., every 480 milliseconds). This means that comfort noise parameters are updated only about twice per second. This low transmission rate cannot accurately represent the spectral and temporal characteristics of the background noise and, therefore, some degradation in the quality of background noise is unavoidable during DTX.
A further problem that arises during DTX in digital cellular systems, such as GSM, relates to a hangover period of a few speech frames that is introduced after a speech burst, and before the actual transmission is terminated. If the speech burst is below some threshold duration, it can be interpreted as a background noise spike, and in this case the speech burst is not followed by a hangover period. The hangover period is used for computing an estimate of the characteristics of the background noise on the transmit side to be transmitted to the receive side in a comfort noise parameter message (or Silence Descriptor (SID) frame), before the transmission is terminated. As was described above, the transmitted background noise estimate is used on the receive side to generate comfort noise with characteristics similar to the transmit side background noise at the time the transmission is terminated.
In known types of DTX mechanisms similar to those of GSM FR and HR, non-predictive comfort noise quantization schemes are employed. Due to this, the receive side does not have to know if a hangover period exists at the end of a speech burst. However, in GSM EFR, efficient predictive comfort noise quantization schemes are employed, and the existence of a hangover period is locally evaluated at the receive side to assist in comfort noise dequantization. This involves a small computational load and a number of program instructions to be executed.
Another problem arises if the background noise on the transmit side is not stationary but varies considerably. In this case there may exist a single frame or a small number of frames within an averaging period for which some or all of the speech coding parameters provide a poor characterization of the typical background noise. A similar situation may occur when a Voice Activity Detection or VAD algorithm interprets the unvoiced end of the period of active speech as xe2x80x9cno speechxe2x80x9d, or the stationary background noise contains strong impulse-type noise bursts. Because of the short duration of the averaging periods in known types of DTX systems such ill-conditioned speech coding parameters may change the result of the averaging significantly enough that the resulting averaged CN parameters do not accurately characterize the background noise. This results in a mismatch either in the level or in the spectrum, or both, between the background noise and the comfort noise. The quality of transmission is thus impaired as the background noise sounds different to the user depending on whether it is received during speech (normal speech coding of speech and background noise) or during speech pauses (produced by comfort noise generation).
In greater detail, during the DTX hangover period any frames declared by the VAD algorithm as being xe2x80x9cno speechxe2x80x9d frames are sent over the air interface, and the speech coding parameters are buffered to be able to evaluate the comfort noise parameters for a first SID frame. The first SID frame is transmitted immediately after the end of the DTX hangover period. The length of the DTX hangover period is thus determined by the length of the averaging period. Therefore, to minimize the channel activity of the system, the averaging period should be fixed at a relatively short length.
Before describing the present invention, it will be instructive to review conventional circuitry and methods for generating comfort noise parameters on the transmit side, and for generating comfort noise on the receive side. In this regard reference is thus first made to FIGS. 1a-1d. 
Referring to FIG. 1a, short term spectral parameters 102 are calculated from a speech signal 100 in a Linear Predictive Coding (LPC) analysis block 101. LPC is a method well known in the prior art. For simplicity, discussed herein is only the case where the synthesis filter has only a short term synthesis filter, it being realized that in most prior art systems, such as in GSM FR, HR and EFR coders, the synthesis filter is constructed as a cascade of a short term synthesis filter and a long term synthesis filter. However, for the purposes of this description a discussion of the long term synthesis filter is not necessary. Furthermore, the long term synthesis filter is typically switched off during comfort noise generation in prior art DTX systems.
The LPC analysis produces a set of short term spectral parameters 102 once for each transmission frame. The frame duration depends on the system. For example, in all GSM channels the frame size is set at 20 milliseconds.
The speech signal is fed through an inverse filter 103 to produce a residual signal 104. The inverse filter is of the form:                               A          ⁡                      (            z            )                          =                  1          -                                                    ∑                M                                            i                =                1                                      ⁢                                          a                ⁡                                  (                  i                  )                                            ⁢                                                z                                      -                    i                                                  .                                                                        (        1        )            
The filter coefficients a(i), i=1, . . . , M are produced in the LPC analysis and are updated once for each frame. Interpolation as is known in prior art speech coding may be applied in the inverse filter 103 to obtain a smooth change in the filter parameters between frames. The inverse filter 103 produces the residual 104 which is the optimal excitation signal, and which generates the exact speech signal 100 when fed through synthesis filter 1/A(z) 112 on the receive side (see FIG. 1b). The energy of the excitation sequence is measured and a scaling gain 106 is calculated for each transmission frame in excitation gain calculation block 105.
The excitation gain 106 and short term spectral coefficients 102 are averaged over several transmission frames to obtain a characterization of the average spectral and temporal content of the background noise. The averaging is typically carried out over four frames for the GSM FR channel to eight frames, as is the case for the GSM EFR channel. The parameters to be averaged are buffered for the duration of the averaging period in blocks 107a and 108a (see FIG. 1d). The averaging process is carried out in blocks 107 and 108, and the average parameters that characterize the background noise are thus generated. These are the average excitation gain gmean and the average short term spectral coefficients. In modern speech codecs, there are typically 10 short term spectral coefficients (M=10) which are usually represented as Line Spectral Pair (LSP) coefficients fmean (i), i=1, . . . , M, as in the GSM EFR DTX system. Although these parameters are typically quantized prior to transmission, the quantization is ignored in this description for simplicity, in that the exact type of quantization that is performed is irrelevant to an understanding of the operation of the invention as described below.
Referring briefly to FIG. 1d, it is shown that the averaging blocks 107 and 108 each typically include the respective buffers 107a and 108a, which output buffered signals 107b and 108b, respectively, to the averaging blocks. Greater attention will be paid to the buffers 107a and 108a below when describing the embodiments of the invention shown in FIGS. 4 and 5.
The computation and averaging of the comfort noise parameters is explained in detail in GSM recommendation: GSM 06.62 xe2x80x9cComfort noise aspects for Enhanced Full Rate (EFR) speech traffic channelsxe2x80x9d. Also by example, discontinuous transmission is explained in GSM recommendation: GSM 06.81 xe2x80x9cDiscontinuous Transmission (DTX) for Enhanced Full Rate (EFR) for speech traffic channelsxe2x80x9d, and voice activity detection (VAD) is explained in GSM recommendation: GSM 06.82 xe2x80x9cVoice Activity Detection (VAD) for Enhanced Full Rate (EFR) speech channelsxe2x80x9d. As such, the details of these various functions are not further discussed here.
Referring to FIG. 1b, there is shown a block diagram of a conventional decoder on the receive side that is used to generate comfort noise in the prior art speech communication system. The decoder receives the two comfort noise parameters, the average excitation gain gmean and the set of average short term spectral coefficients fmean (i), i=1, . . . M, and based on the parameters the decoder generates the comfort noise. The comfort noise generation operation on the receive side is similar to speech decoding, except that the parameters are used at a significantly lower rate (e.g., once every 480 milliseconds, as in the GSM FR and EFR channels), and no excitation signal is received from the speech encoder. During speech decoding the excitation on the receive side is obtained from a codebook that contains a plurality of possible excitation sequences, and an index for the particular excitation vector in the codebook is transmitted along with the other speech coding parameters. For a detailed description of speech decoding and the use of codebooks reference can be had to, by example, U.S. Pat. No.: 5,327,519, entitled xe2x80x9cPulse Pattern Excited Linear Prediction Voice Coderxe2x80x9d, by Jari Hagqvist, Kari Jarvinen, Kari-Pekka Estola, and Jukka Ranta, the disclosure of which is incorporated by reference herein in its entirety.
During comfort noise generation, however, no index to the codebook is transmitted, and the excitation is obtained instead from a random number or excitation (RE) generator 110. The RE generator 110 generates excitation vectors 114 having a flat spectrum. The excitation vectors 114 are then scaled by the average excitation gain gmean in scaling unit 115 so that their energy corresponds to the average gain of the excitation 104 on the transmit side. A resulting scaled random excitation sequence 111 is then input to the speech synthesis filter 112 to generate the comfort noise output signal 113. The average short term spectral coefficients fmean(i) are used in the speech synthesis filter 112.
FIG. 1c illustrates the spectrum associated with the signal in different parts of the prior art decoder of FIG. 1b. The RE-generator 110 produces the random number excitation sequences 114 (and the scaled excitation 111) having a flat spectrum. This spectrum is shown by curve A. The speech synthesis filter 112 then modifies the excitation to produce a non-flat spectrum as shown in curve B.
As was discussed above, a number of problems exist with respect to conventional comfort noise generation techniques. These problems include the mismatch between the random excitation and the correct or optimal excitation which results in the comfort noise generated at the receive side sounding different from the actual background noise on the transmit side. It is a goal of this invention to reduce or eliminate these problems.
It is thus a first object and advantage of this invention to provide an improved method for generating comfort noise during discontinuous transmission, and to minimize a loss of signal quality due to the use of discontinuous transmission.
It is a further object and advantage of this invention to provide improved comfort noise generation methods that are able to better characterize background noise, and that further provide an improved quality of comfort noise and an improved quality of transmission during discontinuous transmission.
It is another object and advantage of this invention to provide an enhanced comfort noise generation technique that eliminates or minimizes the generation of non-representative comfort noise, and which employs a reduced averaging time.
The foregoing and other problems are overcome and the objects and advantages of the invention are realized by methods and apparatus in accordance with embodiments of this invention, wherein an improved method for generating comfort noise (CN) in discontinuous transmission (DTX) is provided.
The invention provides an improved method for comfort noise generation, in which the random excitation is modified by a spectral control filter so that the frequency content of comfort noise and background noise become similar.
In accordance with the teaching of this invention the conventional random excitation with flat spectral distribution is not used as the excitation during comfort noise generation. Instead the random excitation is suitably modified so that the comfort noise more accurately characterizes the spectrum of the background noise that is present on the transmit side of the communication. This results in an improved quality of comfort noise.
Steps of the method of this invention include calculating random excitation spectral control (RESC) parameters on the transmit side. On the receive side, the spectral control parameters are used to modify the random excitation so that the spectral content of the generated or produced comfort noise matches more accurately that of the actual background noise at the transmit side. The random excitation spectral control (RESC) parameters are calculated during speech pauses, together with the rest of the comfort noise parameters, and are then transmitted to the receive side.
In accordance with a method of this invention, a first step calculates random excitation spectral control (RESC) parameters on the transmit side. These parameters are transmitted to the receive side together with other CN-parameters. On the receive side, the RESC-parameters are used for shaping the spectral content of excitation prior to applying it to the synthesis filter.
Further in accordance with this invention all or a predetermined number of ill-conditioned speech coding parameters within an averaging period are removed, or replaced by applying a median replacement method, when the parameters are averaged. In this embodiment of the invention steps are executed of measuring the distances of the speech coding parameters from each other between individual frames within an averaging period, ordering these parameters according to the measured distances, finding the parameters which have the largest distances to the other parameters within the averaging period, and, if the distances exceed a predetermined threshold, replacing these parameters with a parameter which has a smallest measured distance (i.e., a median value) to the other parameters within the averaging period. The median valued parameter is considered to have a value which is the most faithful representation of the characteristics of the background noise among the parameters within the averaging period. After this procedure, the averaging of the speech coding parameters may be performed in any desired manner. Furthermore, the teaching of this embodiment of the invention does not change the way in which the CN parameters are received and used on the receive side of the DTX system.
In addition to removing the ill-conditioned CN parameters from the averaging period, and thereby improving the comfort noise quality, this embodiment of the invention provides other advantages. For example, in prior art DTX systems a longer averaging period is required to be used in order to reduce the effect of the ill-conditioned parameters in the averaging. The use of this invention beneficially allows the use of a shorter averaging period than in prior art DTX systems, since the effect of the ill-conditioned parameters on the averaging operation is reduced. Also, in the prior art DTX systems a longer hangover period is required due to the longer averaging period, thereby increasing the channel activity. The shorter averaging period made possible by this embodiment of the invention thus also enables the DTX hangover period to be reduced, and thereby reduces channel activity. Furthermore, in the prior art DTX systems, due to the longer averaging period employed, a significant amount of static memory is required by the CN averaging algorithm. A further advantage of the shortened averaging period achieved by this invention is a reduction in an amount of static memory required by the CN averaging algorithm.