One purpose of noise suppression or speech enhancement in a mobile telephone terminal is to reduce the impact of environmental noise on a speech signal and thus to improve the quality of communication. In the case of an up-link (transmission, TX) signal, it is also desired to minimise detrimental effects in the speech coding process caused by this noise.
In face-to-face communication, acoustic background noise disturbs a listener and makes it more difficult to understand speech. Intelligibility is improved by a speaker raising his or her voice so that it is louder than the background noise. In the case of telephony, background noise is troublesome because there is no additional information provided by facial expressions and gestures.
In digital telephony, a speech signal is first converted into a sequence of digital samples in an analogue-to-digital (A/D) converter and then compressed for transmission using a speech codec. The term codec is used to describe a speech encoder/decoder pair. In this description, the term “speech encoder” is used to denote the encoding side of the speech codec and the term “speech decoder” is used to denote the decoding functions of the speech codec. It should be appreciated that a general speech codec may be implemented as a single functional unit, or as separate elements that implement the encoding and decoding operations.
In digital telephony, the deleterious effect of background noise can be great. This is due to the fact that speech codecs are generally optimised for efficient compression and acceptable reconstruction of speech and their performance can be impaired if noise is present in the speech signal, or errors occur in speech transmission or reception. In addition, the presence of noise itself can lead to distortion to the background noise signal when it is encoded and transmitted.
Impaired performance of a speech codec reduces both the intelligibility of the transmitted speech and its subjective quality. Distortion of the transmitted background noise signal degrades the quality of the transmitted signal, making it more annoying to listen to and rendering contextual information less recognisable by changing the nature of the background noise signal. Consequently, work in the field of speech enhancement has concentrated on studying the effect of noise on speech coding performance and producing pre-processing methods to reduce the impact of noise on speech codecs.
The problems discussed above relate to arrangements in which only one microphone is present to provide only one signal. In such arrangements a noise suppressor is provided which can interpret the one-channel signal to decide which parts of it represent underlying speech and which represent noise.
When a digital mobile terminal receives an encoded speech signal, it is decoded by the decoding part of the terminal's speech codec and supplied to a loudspeaker or earpiece for the user of the terminal to hear. A noise suppressor may be provided in the speech decoding path, after the speech decoder, in order to reduce the noise component in the received and decoded speech signal. However, in noisy conditions the performance of the speech decoder may be affected detrimentally, resulting in one or more of the following effects:    1. The speech component of the signal may sound less natural or harsh, as critical information required by the speech codec in order to correctly decode the speech signal is altered by the presence of noise.    2. The background noise may sound unnatural because codecs are generally optimised for compressing speech rather than noise. Typically this gives rise to increased periodicity in the background noise component and may be sufficiently severe to cause the loss of contextual information carried by the background noise signal.
Information about an encoded speech signal may also be lost or corrupted during transmission and reception, for example due to transmission channel errors. This situation may give rise to further deterioration in the speech decoder output, causing additional artefacts to become apparent in the decoded speech signal. When a noise suppressor is used in the speech decoding path, after a speech decoder, non-optimal performance of the speech decoder may in turn cause the noise suppressor to operate in a less than optimal manner.
Therefore special care must be taken when implementing noise suppressors intended to operate on decoded speech signals. In particular, two conflicting factors have to be balanced. If the noise suppressor provides too much noise attenuation, this may reveal the deterioration in speech quality caused by the speech codec. However, due to the intrinsic properties of typical speech codecs, which are optimised for the encoding and decoding of speech, decoded background noise can sound more annoying than the original noise signal and so it should be attenuated as much as possible. Thus, in practice, it is found that a slightly lower level of noise reduction may be optimal for decoded speech signals, compared with that which can be applied to speech signals prior to encoding.
It is generally desirable that when noise suppression is used during speech encoding and/or decoding, it should reduce the level of background noise, minimise the speech distortion caused by the noise reduction process and preserve the original nature of the input background noise.
An embodiment of a mobile terminal comprising a noise suppressor according to prior art will now be described with reference to FIG. 1. The mobile terminal and the wireless system with which it communicates operate according to the Global System for Mobile telecommunications (GSM) standard. FIG. 1 shows a mobile terminal 10 comprises a transmitting (speech encoding) branch 12 and a receiving (speech decoding) branch 14.
In the transmitting (speech encoding) branch, a speech signal is picked up by a microphone 16 and sampled by an analogue-to-digital (A/D) converter 18 and noise suppressed in a noise suppressor 20 to produce an enhanced signal. This requires the spectrum of the background noise to be estimated so that background noise in the sampled signal can be suppressed. A typical noise suppressor operates in the frequency domain. The time domain signal is first transformed to the frequency domain, which can be carried out efficiently using a Fast Fourier Transform (FFT). In the frequency domain, voice activity has to be distinguished from background noise, and when there is no voice activity, the spectrum of the background noise is estimated. Noise suppression gain coefficients are then calculated on the basis of the current input signal spectrum and the background noise estimate. Finally, the signal is transformed back to the time domain using an inverse FFT (IFFT).
The enhanced (noise suppressed) signal is encoded by a speech encoder 22 to extract a set of speech parameters which are and then channel encoded in a channel encoder 24 where redundancy is added to the encoded speech signal in order to provide some degree of error protection. The resultant signal is then up-converted into a radio frequency (RF) signal and transmitted by a transmitting/receiving unit 26. The transmitting/receiving unit 26 comprises a duplex filter (not shown) connected to an antenna to enable both transmission and reception to occur.
A noise suppressor suitable for use in the mobile terminal of FIG. 1 is described in published document WO97/22116.
In order to lengthen battery life, different kinds of input signal-dependent low power operation modes are typically applied in mobile telecommunication systems. These arrangements are commonly referred to as discontinuous transmission (DTX). The basic idea in DTX is to discontinue the speech encoding/decoding process in non-speech periods. DTX is also intended to limit the amount of data that is transmitted over the radio link during pauses in speech. Both measures tend to reduce the amount of power consumed by the transmitting device. Typically, some kind of comfort noise signal, intended to resemble the background noise at the transmitting end, is produced as a replacement for actual background noise. DTX handlers are well known in the art such as the GSM Enhanced Full Rate (EFR), Full Rate and Half Rate speech codecs.
Referring again to FIG. 1, the speech encoder 22 is connected to a transmission (TX) DTX handler 28. The TX DTX handler 28 receives an input from a voice activity detector (VAD) 30 which indicates whether there is a voice component in the noise suppressed signal provided as the output of the noise suppressor block 20. The VAD 30 is basically an energy detector. It receives a filtered signal, compares the energy of the filtered signal with a threshold and indicates speech whenever the threshold is exceeded. Therefore, it indicates whether each frame produced by the speech encoder 22 contains noise with speech present or noise without speech present. The most significant difficulty in detecting speech in a signal generated by a mobile terminal is that the environments in which such terminals are used often lead to low speech/noise ratios. The accuracy of the VAD 30 is improved by using filtering to increase the speech/noise ratio before the decision is made as to whether speech is present.
Of all the environments in which mobile telephones are used, the worst speech/noise ratios are generally encountered in moving vehicles. However, if the noise is relatively stationary for extended periods, that is, if the noise amplitude spectrum does not vary much in time, it is possible to use an adaptive filter with suitable coefficients to remove much of the vehicle noise.
The noise levels in environments where mobile terminals are used may change constantly. The frequency content (spectrum) of the noise may also change, and can vary considerably depending on circumstances. Because of these changes, the threshold and adaptive filter coefficients of the VAD 30 must be constantly adjusted. To provide reliable detection, the threshold must be sufficiently above the noise level to avoid noise being falsely identified as speech, but not so far above it that low level parts of speech are identified as noise. The threshold and the adaptive filter coefficients are only up-dated when speech is not present. Of course, it is not prudent for the VAD 30 to up-date these values on the basis of its own decision about the presence of speech. Therefore, this adaptation only occurs when the signal is substantially stationary in the frequency domain, but does not have the pitch component inherent in voiced speech. A tone detector is also used to prevent adaptation during information tones.
A further mechanism is used to ensure that low level noise (which is often not stationary over long periods) is not detected as speech. In this case, an additional fixed threshold is used so that input frames having frame power below the threshold are interpreted as noise frames.
A VAD hangover period is used to eliminate mid-burst clipping of low level speech. Hangover is only added to speech-bursts which exceed a certain duration to avoid extending noise spikes. Operation of a voice activity detector in this regard is known in the art.
The output of the VAD 30 is typically a binary flag which is used in the TX DTX handler 28. If speech is detected in a signal, its transmission continues. If speech is not detected, transmission of the noise suppressed signal is stopped until speech is detected again.
In most mobile telecommunication systems, DTX is mostly applied in the up-link connection since speech encoding and transmission is typically much more power consuming than reception and speech decoding, and because the mobile terminal typically relies on the limited energy stored in its battery. During periods in which there is no transmission of a signal supposedly carrying speech, comfort noise is generated to give the listener an illusion that the signal is, in fact, continuous. As described in further detail below, in some cellular telephone systems, comfort noise is generated in the receiving terminal, on the basis of information received from the transmitting terminal describing the characteristics of the noise at the transmitting terminal.
Generally, an explicit flag is provided in the speech decoder indicating whether the DTX operation mode is on or not. This is the case with, for example, all of the GSM speech codecs. Other cases exist, however, for example Personal Digital Cellular (PDC) networks, where a frame repeating mode must be activated in the noise suppressor by comparing input frames to earlier ones and setting up a voice operated switch (VOX) flag if consecutive frames are identical. Furthermore, in a mobile-to-mobile connection, no information is provided in the down-link connection about the occurrence of DTX in the up-link connection.
In some speech codecs, such as the GSM EFR codec, the decision to switch off transmission during pauses in speech is made in a DTX handler of the speech encoder. At the end of a speech burst, the DTX handler uses a few consecutive frames to generate a silence descriptor (SID) frame which is used to carry comfort noise parameters describing estimated background noise characteristics to the decoder. A silence descriptor (SID) frame is characterised by an SID code word.
After transmission of an SID frame, radio transmission is cut and a speech flag (SP flag) is set to zero. Otherwise, the SP flag is set to 1 to indicate radio transmission. The SID frame is received by the speech decoder, which then generates noise with a spectral profile corresponding to the properties described in the SID frame. Occasional SID frame updates are transmitted to the decoder to maintain a correspondence between the background noise at the transmitting terminal and the comfort noise generated in the receiving terminal. For example, in a GSM system, a new SID frame is sent once every 24 frames of normal transmission. Providing occasional SID frame updates in this way not only enables the generation of acceptably accurate comfort noise, but also significantly reduces the amount of information that must be transmitted over the radio link. This reduces the bandwidth required for transmission and aids efficient use of radio resources.
In the receiving (speech decoding) branch 14 of the mobile terminal, an RF signal is received by the transmitting/receiving unit 26 and down-converted from RF to base-band signal. The base-band signal is channel decoded by a channel decoder 32. If the channel decoder detects speech in the channel decoded signal, the signal is speech decoded by a speech decoder 34.
The mobile terminal also comprises a bad frame handling unit 38 to handle bad (i.e. corrupted) frames. A bad traffic frame is flagged by the Radio Sub-System (RSS) by setting a Bad Frame Indication (BFI) to 1. If errors occur in the transmission channel, normal decoding of lost or erroneous speech frames would give rise to a listener hearing unpleasant noises. To deal with this problem, the subjective quality of lost speech frames is typically improved by substituting bad frames with either a repetition or an extrapolation of a previous good speech frame or frames. This substitution provides continuity of the speech signal and is accompanied by a gradual attenuation of the output level, resulting in silencing of the output within a rather short period. A good traffic frame is flagged by the radio subsystem with a BFI of 0.
An embodiment of a prior art bad frame handling unit 38 is located in the Receive (RX) Discontinuous Transmission (DTX) handler. The bad frame handling unit carries out frame substitution and muting when the radio sub-system indicates that one or more speech or Silence Descriptor (SID) frames have been lost. For example, if SID frames are lost, the bad frame handling unit notifies the speech decoder of this fact and the speech decoder typically replaces a bad SID frame with the last valid one. This frame is repeated and gradually attenuated just as in the case of a repeated speech frame, in order to provide continuity to the noise component of the signal. Alternatively, an extrapolation of a previous frame is used rather than a direct repetition.
The purpose of frame substitution is to conceal the effect of lost frames. The purpose of attenuating the output when several frames are lost is to indicate the possible breakdown of the radio link (channel) to the user and to avoid generating possibly annoying sounds, which may result from the frame substitution procedure. However, substitution and attenuation of the usually uninformative background noise in the lost frames affects the perceived quality of the noisy speech or the pure background noise. Even at rather low levels of background noise, rapid attenuation of the background noise in lost frames leads to an impression of a badly decreased fluency of the transmitted signal. This impression becomes stronger if the background noise is louder.
The signal produced by the speech decoder, whether decoded speech, comfort noise or repeated and attenuated frames, is converted from digital to analogue form by a digital-to-analogue converter 40 and then played through a speaker or earpiece 42, for example to a listener.